0% found this document useful (0 votes)
41 views6 pages

Classification_of_Fraud_Calls_by_Intent_Analysis_of_Call_Transcripts

This document presents a study on classifying fraudulent phone calls through intent analysis of call transcripts using machine learning techniques. The authors propose a system that analyzes conversations to detect phishing calls, achieving accuracies of 94.57% and 97.21% with Naive Bayes and CNN models, respectively. The research highlights the inadequacies of traditional detection methods that rely on blacklists and emphasizes the need for a more robust approach to combat phone fraud.

Uploaded by

negawhatt.07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views6 pages

Classification_of_Fraud_Calls_by_Intent_Analysis_of_Call_Transcripts

This document presents a study on classifying fraudulent phone calls through intent analysis of call transcripts using machine learning techniques. The authors propose a system that analyzes conversations to detect phishing calls, achieving accuracies of 94.57% and 97.21% with Naive Bayes and CNN models, respectively. The research highlights the inadequacies of traditional detection methods that rely on blacklists and emphasizes the need for a more robust approach to combat phone fraud.

Uploaded by

negawhatt.07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

IEEE - 51525

CLASSIFICATION OF FRAUD CALLS BY


INTENT ANALYSIS OF CALL TRANSCRIPTS
2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT) | 978-1-7281-8595-8/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICCCNT51525.2021.9579632

Neha Kale∗ , Shivangi Kochrekar† , Rishita Mote‡ and Surekha Dholay§


Department of Computer Engineering, Sardar Patel Institute Of Technology
Mumbai, India
Email: ∗ [email protected], † [email protected],
[email protected], § surekha [email protected]

Abstract—While the rapid growth of technology makes life


easier for consumers it has also brought new threats along with
it. People can be duped to reveal sensitive data such as banking
and credit card details, personally identifiable information, and
passwords. One such technique is phishing through phone calls.
The number of people who fall for such scams is an astounding
figure. The conventional approach to detecting phishing call TABLE I
fraud depends on a blacklist of known fraud numbers. This S TATISTICS OF F RAUDS FOR THE LAST 3 F INANCIAL Y EARS
creates a problem when new numbers or numbers which have
not been encountered by the system before are used. To solve
this, we propose a system that will classify a fraudulent call
by analyzing the conversation between the potential victim
and the caller. We used various machine learning techniques
to perform intent analysis of call transcripts. We created two According to the information reported to and tracked by
models and compared them. The models based on the Naive the Indian Computer Emergency Response Team (CERT-In)
Bayes Algorithm and CNN have an accuracy of 94.57% and
97.21% respectively. a total of 454, 472 and 194 phishing incidents were noticed
during the year 2018, 2019 and 2020 (till August) respectively.
Index Terms—Classification algorithm, CNN architecture, Further, a total of 6, 4 and 2 financial fraud incidents involving
Fraud, Machine learning, Phishing. ATMs, Cards, Point-of-Sale (PoS) systems and Unified Pay-
ment Interface (UPI) have been reported during the year 2018,
I. I NTRODUCTION 2019 and 2020 (till August) respectively [2].
With the advent of new technologies, banks and other finan- The conventional approach to detecting phishing call fraud
cial institutions, the government, and various other companies depends on a blacklist of known fraud numbers. However,
are shifting their base online and providing convenient and fraudsters can simply change their number to evade detection.
faster services to their customers. This new avenue however To solve this problem, we propose a system that will classify
has brought new risks to the money and data of customers. a fraudulent call by analyzing the conversation between the
One such threat is phishing phone calls. potential victim and the caller.
Phone call frauds are a cause of severe monetary and In particular, we used data collected through various sources
personal information loss to a lot of people in India. The having reports or testimonies of the victims of such crimes.
elderly and techno-phobic people who are naive and have In addition, we passed a survey to collect data about frauds
no experience with these types of scams are more likely to calls. The survey collected experiences of the people who were
fall prey to these devious crimes. In the financial year 2019- a victim to fraud calls. We used conversational data sets and
2020, more than 50,000 cases of deceitful usage of internet then combined the fraud and not fraud data sets. Data analysis
banking, debit cards and credit cards were reported in India and data visualization was then performed to better understand
as revealed by the Minister of State for Electronics and IT, the data we had. The data set was found to be highly skewed
Sanjay Dhotre[1]. The money involved in these fraudulent since the number of non-fraud calls was much greater than
transactions is about Rs 228 crore which is around 80 crores the number of fraud calls. Then we cleaned the data and
more than the financial year 2018-2019. used various pre-processing techniques on it. After splitting
Table 1. shows the data on frauds in India as reported the data set into training and testing data sets, we performed
by Standard Chartered Banks(SCBs) and First Investment oversampling on the training data set to make the minority
Banks(FIs) in the category ‘Card/Internet - ATM/Debit Cards, and the majority class equal. We created two models based
Credit Cards and Internet Banking’. The data is for the last on two different approaches: Naive Bayes and CNN models.
three Financial Years and the period up to the quarter which Then we compared both these models and evaluated them.
ended in December 2019 of the current Financial Year based Our results are positive and the project would be beneficial to
on Reporting (Rs. in cr.) : further researchers.

12th ICCCNT 2021


Authorized licensed use limited to: Indian Institute Of Technology Bhilai. Downloaded on November 10,2024 at 14:00:45 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2021 - IIT - Kharagpur
Kharagpur, India
IEEE - 51525

II. L ITERATURE S URVEY phone.


Peng and Lin [9] found that most existing fraud detection
Research has been conducted on email phishing, which is models mark the phone numbers of fraudsters and warn the
a traditional way of attack. The work of Şentürk et al. [3] users according to the marked results. They proposed a scam
is based on machine learning and data mining techniques to call analysis method that is based on the label propagation
detect phishing emails. Phishing through phone calls wherein community detection algorithm (LPA). The call content is
the fraudster manipulates the victim to divulge sensitive infor- converted into a complex network. Then the LPA algorithm
mation during a phone call is a newer form of fraud. is used to create fraud communities on that complex network.
Jabbar and Suharjito [4] employed unsupervised machine Marzuoli et al. [10] performed a large-scale data-driven
learning techniques which used Call Detail Records(CDR) analysis of the telephony spam and fraud ecosystem. They
to detect fraud calls. The variables used are number dialed, uniquely identified bad actors potentially operating several
destination city, duration, caller number and fee of the data phone numbers. The data set was collected from a website
set which are similar to the traditional variables used in the called ”honeypot”. It contains around 8,000,000 calls that
detection of email phishing. were received in 2015. Out of these 880,000 were from
Maseno [5] proposed a theoretical model that can be used distinct sources and 80,000 had distinct destinations. They
to detect such attacks. The model aims to help the user to collected about 40,000 such call recordings data. They then
effectively and quickly identify if the caller is trying to divulge demonstrated that only a few bad actors are responsible for
information from them. The study conducted cross-sectional the majority of telephony spam and fraud and that they can
survey research on 20 respondents who were selected using be uniquely identified by their audio signature. They studied
random sampling. Data was collected through a structured the semantic information obtained from call recordings using
questionnaire for mobile phone users. An interview guide was NLP and clustering algorithms. Then the audio features of
used for the key informants in Kenya. Content analysis was the call were extracted from each cluster and analyzed to
used to analyze qualitative data while quantitative data was detect whether the call is fraudulent or not. So their work
analyzed using SPSS. The findings revealed that the major was majorly based on identifying the existing fraudsters but
contributing factors in such attacks are information sensitivity, would be unable to work on new fraudsters.
psychological factors, and technical factors. A model was Choi et al. [11] have assessed the modus operandi of voice
developed to aid users in the detection of phishing phone call phishing using crime script analysis. The results of their study
attacks based on these three main factors. showed that the preparation for this kind of voice phishing
While the previous study provided a theoretical model, includes readying for the crime, recruiting telemarketers and
Zhao et al. [6] have created an Android application to de- creating the scripts. The next step involves randomly making
tect telecommunication frauds. When a call is answered, the international and voice-over-internet calls to a vast amount of
application can actively analyze the contents of the call so people, which constitutes its major activity. The post-activity
that frauds can be identified. In particular, they collected involves the withdrawing and wiring of the amount of money
descriptions of such scams from news reports and social deposited by victims to the perpetrators.
media. Then they used machine learning algorithms to scan Saini [12] explains how the bad guys use social engineering
this data and to choose high-quality descriptions to form data techniques to steal the personal and sensitive information of
sets. After this, they leveraged natural language processing to a user. The modus operandi of voice phishing that is used
perform feature extraction from the textual data. Then they by these fraudsters nowadays is also explained. Based on the
created rules for a model that identifies similar content for survey, some of the cases studies and examples are mentioned
further fraud detection. The content used for matching was along with the protective measures that a user can take to
dependent on news reports from social media and not from protect their personal information.
actual calls. Zhang et al. [13] have proposed a design (CATINA) a
Hollmén et al. [7] used a hierarchical regime-switching novel content-based approach, using the TF-IDF information
model to detect call-based frauds. This research was performed retrieval algorithm. They have implemented and discussed the
in 1998 and the model is outdated now. evaluation of several heuristics to reduce false positives. This
Kedem et al. [8] have targeted vishing attacks using a new method was however limited to websites.
approach wherein the attacker provides step-by-step instruc- Tu et al. [14] performed a study to work out the methodol-
tions to the victim over the phone which tell the victim to ogy, design, execution, results, analysis and evaluation for why
log in to his account and perform a banking transaction. Their vishing attacks work and what countermeasures to take against
proposed system monitors the gestures performed via input them. The study was performed using 10 telephone phishing
units, transactions, timing and speed of data entry, online call experiments on 3,000 of their university participants
operations, user engagement and user interactions with user including staff and faculty without prior awareness. The results
interface elements. It ascertains that the victim is operating were analyzed by performing linear regression and statistical
under dictated instructions by detecting the data entry rhythm hypothesis testing methods through which they identified that
and many other typical behaviors exhibited while performing spoofed Caller ID had a significant impact on tricking the
an online banking transaction while also speaking on the victims into revealing their Social Security number.

12th ICCCNT 2021


Authorized licensed use limited to: Indian Institute Of Technology Bhilai. Downloaded on November 10,2024 at 14:00:45 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2021 - IIT - Kharagpur
Kharagpur, India
IEEE - 51525

Maseno et al. [15] worked on a study aimed at finding out


the pivot factors of vishing attacks. Their study was cross-
sectional survey research. The sample space of respondents
was selected using random sampling and their data was
collected using a structured questionnaire for mobile phone
users. The study revealed that the pivot factors for such attacks
are psychological, technical and information sensitivity based.
Based on these factors, mitigation measures were proposed.
Aleroud and Zhou [16] created a multidimensional phishing
taxonomy based on a comprehensive survey of the related
literature. The taxonomy provides an integrated view of phish-
ing that consists of four dimensions: communication media,
target environments, attacking techniques, and countermea-
sures. Their phishing countermeasures provide a classification
consisting of five categories: Human Users, Profile Matching,
Machine Learning, Text Mining, and the last category consists
of ontology, search engines, honeypot countermeasures and
client-server authentication.
Alabdan [17], performed a comprehensive analysis on the
characteristics of the existing classic and modern phishing
attack techniques. It explains the various characteristics of the
different approaches and types of phishing techniques, may
serve as a base for developing a more holistic anti-phishing
system.
We observed that most of the current work is focused on
theoretical models or fraud mitigation techniques. The few
automated systems developed rely on call-related features like
blacklisted phone numbers, the fraudster’s location, caller id,
and voice to detect the fraud calls. The system we proposed
will use the content of the call rather than other external
features.
III. P ROPOSED M ETHODOLOGY
In this section, we describe our system, its key features
and the procedure we employed to create the system. From Fig. 1. Flow of Methodology
our study of the existing technologies and related work, we
identified some limitations as mentioned in the Literature
Survey. Then we came up with a feasible solution that is better testimonies of the victims of such crimes. Then we merged
equipped at solving this problem. Our system will classify a these two parts and formatted it to create our data set. The
fraud call based on the transcript of the call conversation. We data set has a total of 2775 transcripts.
strive to analyze the intent of the caller based on the content
B. Data Analysis and Visualization
of the calls. Our algorithm is built using call transcripts in the
English language. We plotted the data as shown in Fig. 2 and Fig. 3 and
The steps to create the model are shown diagrammatically performed data analysis. The data was found to be skewed.
in Fig. 1 and are described as follows: This is because fraud calls form a very small percentage of
the total calls received by someone. When a data set does not
A. Assembling Data represent all classes of data equally, the model might overfit
We searched for phone call transcripts data sets. However to the class that’s represented more in your data set. It might
such data sets are not abundantly available to the public due to become oblivious to the minority class. Thus it might even
privacy concerns. So we also searched for conversational data give a good accuracy but might fail miserably in real life. In
sets and found various data sets which with a little processing our project, a model that keeps predicting that call is not a
could suit our needs. These would be the part of the data fraud call every single time will also have a good accuracy
set which are not fraud calls. For the fraud phone calls, we as the occurrence of fraud call itself will be rare among the
passed a survey amongst people and based on the responses inputs. But it will fail when an actual case of fraud is subjected
created the fraud calls transcripts part of our data set. We also to classification, therefore failing its original purpose. Hence,
used data collected through various sources having reports or we had to balance the data set.

12th ICCCNT 2021


Authorized licensed use limited to: Indian Institute Of Technology Bhilai. Downloaded on November 10,2024 at 14:00:45 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2021 - IIT - Kharagpur
Kharagpur, India
IEEE - 51525

means removing the affixes of the word and representing


it by its stem word. The stem of the words ‘sleeping’,
‘sleeps’, ‘sleeper’ is sleep. This step helped in normaliz-
ing the corpus.
• Lemmatization: Lemmatization is the same as stemming
but instead of stem word, the word is replaced by its
root word. Stemming sometimes creates new words that
may not have any meaning. So to resolve the problem we
performed stemming first followed by lemmatization.
D. Labelling and splitting the data
After preprocessing the data, we labeled the data with
unique values 1/positive and 0/negative. 1 represents that the
call transcript is a phishing call i.e positive and 0 represents
that it is a normal call transcript. Then we split the data set
into training and testing data sets.
E. Over Sampling
Fig. 2. Line Plot depicting the data set
Johnson and Khoshgoftaar [18] carried out a survey that
provides the most comprehensive analysis of deep learning
methods for addressing the class imbalance data problem.
A widely used technique for dealing with unbalanced data
sets is called resampling. It is done after the data is split
into training and test sets. Resampling is done only on the
training set otherwise the performance measures could get
skewed. Resampling can be of two types: Over-sampling and
Under-sampling. We used oversampling to balance our data
set. Oversampling in simplified terms is duplicating random
records from the minority class to make it equal to the majority
class.
F. Built and trained the models
We created a Naive Bayes and a CNN model. Then we
trained these models on the training data set.
1) CNN Model: A convolutional neural network, which is
a class of deep neural networks, is a Deep Learning Algorithm
that is mainly used in image classification [19]. However, it has
Fig. 3. Box Plot showing the skewness of the data set
its uses in text classification and sentiment analysis too. Kumar
and Zymbler [20] used CNN to analyze customer satisfaction
from airline tweets. A CNN has hidden layers which are
C. Preprocessing Data
known as convolutional layers that are added one after another.
Text preprocessing means clearing the text to make it These layers consist of multiple filters that help in detecting
understandable to the computer. Our application removes the specific features. CNN for text classification mainly uses
information from the call transcripts that is not useful for the three such layers that are: embedding layer, convolutional
model. This is achieved in the following ways: one-dimensional layer and global max-pooling layer. Each of
• Removing Punctuations and Stopwords: We first con- these layers and the steps involved in the CNN algorithm are
verted all the words to lowercase letters. Then we re- explained in the following sections:
moved the punctuation. Stopwords are the words (data) • Embedding Layer: Machine learning algorithms require
that are not useful for the model, e.g., ‘for’, ‘the’, ‘as’, numeric data. There are various encoding techniques
etc. So using the NLTK library in python we filtered out such as Bag Of Words(BOW), TFIDF, Word2Vec that
all the stopwords. encodes the given corpus in a numeric form. This pro-
• Digits to words: The call transcripts consists of digit, cess of converting each word into its vector is “Word
e.g., Rs 25,000. These digits are converted into words Embedding”. The advantage of word embedding is that
which can then be converted easily into word vectors for it collects more information in fewer dimensions. It maps
the algorithm. the semantic meaning of the word in a geometric space
• Stemming: Stemming in Natural Language Processing called an embedding space. Our application uses one

12th ICCCNT 2021


Authorized licensed use limited to: Indian Institute Of Technology Bhilai. Downloaded on November 10,2024 at 14:00:45 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2021 - IIT - Kharagpur
Kharagpur, India
IEEE - 51525

of the most efficient techniques for word embedding principle of this algorithm is that every pair of features that
“Word2Vec”. Word2vec is developed by Google which are being classified are independent of each other.
has pre-trained word embeddings. To train our word In this model, we first stored the positive and negative i.e
embeddings, we used the Gensim Python package which fraud and non-fraud call transcripts present in the training
uses Word2vec calculations. Gensim expects the input data set and tokenized each word. The tokens of positive and
of sentences sequentially. It trains the word and stores negative classes are stored in different dictionaries. Both the
it in the KeyedVector instance. Gensim has several pre- dictionaries are then combined and then passed to the model.
trained models. Once the word vectors are trained they IV. I MPLEMENTATION & R ESULTS
are stored in a format that is compatible with word2Vec
To implement the proposed methodology, the system should
implementation.
have a stable internet connection and the required data set. For
• Convolutional Layer: Before implementing the CNN
the application to run successfully a system having a minimum
model, we first added padding to the sequence to make
RAM capacity of 4GB and a maximum of 8GB is required.
each sentence of the same length. This is achieved by
After implementing our methodology, we obtained an ac-
finding the length of the longest sequence. After padding
curacy of 95.47% for the Naive Bayes model and 97.21% for
the sequences, we implemented the Convolutional 1-D
the CNN model respectively.
layer using the Keras library in Python. This layer is
As shown in the confusion matrix in Fig. 4 where the rows
in between the Embedding layer and GlobalMaxPool-
represent the actual labels and columns represent the predicted
ing1D layer. This layer has several parameters and the
labels, for the Naive Bayes algorithm, most of the calls were
important ones are Kernel size, filter size and activation
classified correctly. Also, the number of normal calls is more
type. Typically, in word embedding, each sentence is
than the number of fraud calls even while testing for a small
represented in a matrix form. The rows of the matrix
subset of calls. Furthermore, none of the normal calls were
represent the tokens in the sentence and the columns
classified to be fraud calls. A less number of fraud calls were
represent vectorize words. This matrix is convolved with
classified to be normal calls.
different filter sizes in the Convolutional 1-D layer.
Since the data set is highly imbalanced, we cannot rely on
We used the filter sizes of [2,3,4,5,6]. The kernel size in
the model’s accuracy only. So to check the model’s perfor-
CNN represents the sequence of words it will convolve
mance we plotted the graph to display the precision, recall
at a given time. So, during the convolution process the
and F1 score of both models. We checked our results using
sequence of words according to the kernel size are taken
the evaluation parameters: precision, recall and F1 scores.
into consideration and are multiplied by the filter size.
Precision, also known as true positive rate, tells us the
These multiplication results are then summed together
number of positive class predictions that truly belong to the
and then feed to an activation function. The activation
positive class. From Fig. 5 and Fig. 6 we observed that for
function that we used is the Rectified Linear Unit(relu).
both the algorithms our precision is high which means that
This function gives the feature value and the mathemati-
the model does not give out many false positives. On the other
cal formula used is as shown in (1):
hand, recall tells us how correctly the model identifies the True
ci = f (w ∗ xi:i+m−1 + b) (1) Positives. The recall for the models in the case of the positive
class is low which implies that there are quite a few instances
Here, c = convolutional process, w = word matrix, x = of positive class i.e. a fraud call to be predicted as negative
element wise multiplication operation, b = bias term b i.e. normal call.
from that row. Once the convolution process is completed
for one filter, all the features obtained by the relu function
are mapped to the feature map as [c1, c2, c3...c(m-1)].
• Global Max Pooling I-D: We then applied the Glob-
alMaxPolling1D layer on the convolution layer to get
the maximum value of the features in a pool for each
feature dimension. When all the filters are applied to the
convolutional layer, a list of feature values is made using
this max-pooling feature.
The final step in CNN is to form a full connection
layer which includes the dropout and regularization from
the final feature vector to the output layer. We then
summarised our model on the training set by displaying
the type of layer used, the Output Shape of each layer
and the connection between layers.
2) Naive Bayes: The Naive Bayes classifier is a classi- Fig. 4. Confusion Matrix of the Naive Bayes Model
fication algorithm based on the Bayes Theorem. The main

12th ICCCNT 2021


Authorized licensed use limited to: Indian Institute Of Technology Bhilai. Downloaded on November 10,2024 at 14:00:45 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2021 - IIT - Kharagpur
Kharagpur, India
IEEE - 51525

to trick people from rural areas or elderly people since they


may be unaware of such heinous frauds. Our models will
help further research which will aid users to avoid such
phishing call attacks. In this way, our work will help people
from getting tricked through fraudulent phone calls and thus
will help to reduce phishing cases.

R EFERENCES
[1] ’Sujay Radhakrishna Vikhepatil, Shrikant Eknath Shinde,
Hemant Patil, Unmesh Bhaiyyasaheb Patil, Sambhajirao
Mane Dhairyasheel’, Fraudulent Usage of Credit/Debit Card,
http://loksabhaph.nic.in/Questions/QResult15.aspx?qref=15384&lsno=17
[2] ’Sumedhanand Saraswati’, Online Frauds and
Scams,http://loksabhaph.nic.in/Questions/QResult15.aspx?qref=
17288&lsno=17
[3] Ş. Şentürk, E. Yerli and İ. Soğukpınar, ”Email phishing detection
and prevention by using data mining techniques”, 2017 International
Conference on Computer Science and Engineering (UBMK), Antalya,
Turkey, 2017
Fig. 5. Results of the CNN Model [4] M.A. Jabbar, Suharjito, ”Fraud Detection Call Detail Record Using
Machine Learning in Telecommunications Company”, Advances in
Science, Technology and Engineering Systems Journal, vol. 5, no. 4,
pp. 63-69 (2020)
[5] Elijah M. Maseno, “Vishing Attack Detection Model For Mobile Users”,
KCA University, 2017
[6] Zhao, Q., Chen, K., Li, T. et al. ”Detecting telecommunication fraud by
understanding the contents of a call”, Cybersecur 1, 8 (2018)
[7] Hollmén, Jaakko & Tresp, Volker, ”Call-Based Fraud Detection in Mo-
bile Communication Networks Using a Hierarchical Regime-Switching
Model”, 889-895
[8] Oren Kedem, Avi Turgeman, Itai NOVICK, Alexander Basil Zaloum,
Leonid Karabchevsky, Shira Mintz, Ron Uriel Maor, ”Device, System,
and Method of Detecting Vishing Attacks”, U. S. Patent 16/188,312,
May 23, 2019
[9] L. Peng and R. Lin, ”Fraud Phone Calls Analysis Based on Label Prop-
agation Community Detection Algorithm,” 2018 IEEE World Congress
on Services (SERVICES), 2018
[10] A. Marzuoli, H. A. Kingravi, D. Dewey and R. Pienta, ”Uncovering the
Landscape of Fraud and Spam in the Telephony Channel,” 2016 15th
IEEE International Conference on Machine Learning and Applications
(ICMLA), 2016
[11] Choi, Kwan & Lee, Ju-lak & Chun, Yong-tae, ”Voice phishing fraud
and its modus operandi”, Security Journal, 2017
[12] Ujjwal Saini, ”Voice Phishing Attacks”, International Research Journal
of Engineering and Technology (IRJET), July 2020
Fig. 6. Results of the Naive Bayes Model [13] Yue Zhang, Jason I. Hong, and Lorrie F. Cranor, ”Cantina: a content-
based approach to detecting phishing web sites, In ¡i¿Proceedings of
the 16th international conference on World Wide Web¡/i¿ (¡i¿WWW
’07¡/i¿). Association for Computing Machinery, New York, NY, USA,
V. C ONCLUSION 2007, 639–648
[14] Tu, H., Doupé, A., Zhao, Z., & Ahn, G. J, ”Users really do answer
After implementing two different algorithms we determined telephone scams”, In Proceedings of the 28th USENIX Security Sym-
that the CNN model gives an accuracy of 97.21% and the posium (pp. 1327-1340). (Proceedings of the 28th USENIX Security
Symposium). USENIX Association, 2019
Naive Bayes model gives an accuracy of 95.47%. The recall [15] Elijah M. Maseno, Patrick Ogao, Samwel Matende, ”Vishing Attacks
i.e. improper classification of fraud calls, an important factor on Mobile Platform in Nairobi County Kenya”, International Journal
for our problem statement is comparatively higher in the case of Advanced Research in Computer Science & Technology (IJARCST
2017)
of CNN than Naive Bayes. Hence we can conclude that the [16] Ahmed Aleroud, Lina Zhou, ”Phishing environments, techniques, and
performance of the CNN model is better and is well equipped countermeasures: A survey”, Computers & Security, Volume 68, 2017,
to classify the calls. ISSN 0167-4048
[17] Alabdan, Rana, ”Phishing Attacks Survey: Types, Vectors, and Technical
There are a few limitations to this project. Newer algorithms Approaches, Future Internet”, 12, (2020)
provide scope to improve the model performance. An interface [18] Johnson, J.M., Khoshgoftaar, T.M. ”Survey on deep learning with class
is needed to implement this model. The ways and methods of imbalance”, J Big Data 6, 27 (2019)
[19] Xin, M., Wang, Y. ’Research on image classification model based on
duping people are always evolving and hence the data will deep convolution neural network’, J Image Video Proc. 2019, 40 (2019)
need to be updated periodically. [20] Kumar, S., Zymbler, M. ”A machine learning approach to analyze
Phishing through phone calls is a modern way to attack customer satisfaction from airline tweets”, J Big Data 6, 62 (2019)
people and seek their personal information. It could be used

12th ICCCNT 2021


Authorized licensed use limited to: Indian Institute Of Technology Bhilai. Downloaded on November 10,2024 at 14:00:45 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2021 - IIT - Kharagpur
Kharagpur, India

You might also like