0% found this document useful (0 votes)
93 views15 pages

Sm-Detector: A Security Model Based On Bert To Detect Smishing Messages in Mobile Environments

Uploaded by

José Patrício
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views15 pages

Sm-Detector: A Security Model Based On Bert To Detect Smishing Messages in Mobile Environments

Uploaded by

José Patrício
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Received: 2 August 2020 Revised: 21 April 2021 Accepted: 25 May 2021

DOI: 10.1002/cpe.6452

RESEARCH ARTICLE

SM-Detector: A security model based on BERT to detect


SMiShing messages in mobile environments

Abdallah Ghourabi1,2

1 Department of Computer Science, Jouf

University, Tabarjal, Saudi Arabia Summary


2 Higher School of Sciences and Technology of
The growing use of SMS by businesses to communicate with their customers has
Hammam Sousse, University of Sousse, Sousse,
made attackers more interested in smishing attacks. Smishing is a security attack that
Tunisia
involves sending a fake SMS message in order to steal the personal credentials of
Correspondence
mobile users. This kind of attack has become a serious cyber-security issue and has
Abdallah Ghourabi, Department of Computer
Science, Jouf University, Tabarjal, Saudi Arabia. caused great financial losses for both people and businesses. In this article we propose
Email: [email protected]
a hybrid security model called “SM-Detector” aiming to detect smishing messages in
mobile environments. To increase the efficiency of “SM-Detector,” we have combined
three different detection methods: (i) identification of malicious URLs, (ii) identification
of suspected words, phone numbers and emails with regular expression analysis, and
(iii) classification of messages using BERT-based algorithms to distinguish spam mes-
sages. “SM-Detector” also includes a mobile application allowing the user to check their
SMS and report smishing messages. Its strength is that it can deal with mixed text mes-
sages written in Arabic or English. The experimental evaluation conducted on English
and Arabic datasets showed a remarkable accuracy of 99.63%.

KEYWORDS
BERT, mobile security, SMiShing detection, SMS classification

1 INTRODUCTION

Since December 3, 1992, the date when Neil Papworth sent the first text message in the world, SMS has become an increasingly popular means
of communication and marketing.1 Nowadays, trillions of text messages connect billions of people around the world. This popularity is mainly due
to the large number of people using mobile phones. According to statistics produced by the GSMA company,2 mobile users will go from 5.2 billion
users in 2019 to 5.8 billion in 2025 (which corresponds to 70% of the world population). GSMA statistics also report that in 2019 3.8 billion users
(49% of the world population) use mobile devices to connect to the Internet. In another report,3 published statistics showed why marketers are
so interested in mobile phone users. This report indicates that in 2019: 90% of people in the United States prefer to receive texts from businesses
over calls; 87% of internet users are on mobile phones; 43% of product research done on mobile phones; 40% of total transactions are on mobile
phones. Also, several businesses consider that text messages are more effective than emails. This is because 82% of SMS are read within 5 min, but
consumers only open one in four emails they receive.4
With the exponential growth of SMS, malicious operations, such as SMiShing and spam, are also increasing. These actions are not only annoying
for the users, but they also cause financial damage for both individuals and organizations.5 Smishing, or SMS phishing, is a scam method similar to
phishing email, which operates via the SMS mobile phone messaging service. SMS messages are sent to owners of smartphones with the aim of
stealing personal or banking data. For example, a smishing SMS announces to the recipient that he has won a gift or a sum of money, or that he needs
to fix a problem that arose with his bank card or electronic account. Usually, the smishing message invites the user to click a link, call a phone number,
or contact an email address provided by the attacker. The victim is then invited to provide private data like credit card number, bank account details,

Concurrency Computat Pract Exper. 2021;e6452. wileyonlinelibrary.com/journal/cpe © 2021 John Wiley & Sons, Ltd. 1 of 15
https://doi.org/10.1002/cpe.6452
2 of 15 GHOURABI

FIGURE 1 An example of an Arabic phishing message

credentials of web accounts, and so on. In social media, the messaging and chat services are not secure either. Scammers use these channels through
social networks like Facebook, Instagram, WhatsApp, and more to execute their phishing attacks.
In Arabic-speaking countries, the number of attacks by phishing messages has increased rapidly in recent years. This type of attack mainly
targets the banking sector and causes important financial losses every year. According to the newspaper Arab News, approximately 1063 cases of
financial fraud worth a total of around $13 million were reported in Saudi Arabia in 2019. Hackers usually lure their victims with messages asking
bank customers to update their bank details, or with messages claiming that bank cards have been blocked and attaching contact numbers or links
for customers to click on. For example, Figure 1 shows an Arabic phishing message attempting to steal the confidential information of the customer’s
bank card. Most of these messages are written in Arabic which makes them difficult to detect by traditional security systems that are trained on
English messages. For that, we have sought to address this issue in this article in order to find a solution.
The literature survey reveals that there are a few approaches proposed by previous researchers to detect and filter smishing messages. While
studying the existing work, we found that most of them used the blacklist method, the whitelist method, heuristic techniques, or even machine
learning classification. Each of these methods has its advantages and disadvantages, and none of them have proven effective when employed alone.
In this article we propose a hybrid solution combining these different methods to detect smishing messages. The objective is to benefit from the
advantages of each method. Blacklist and heuristic techniques are very powerful in the detection of spam containing well-known features. Machine
learning based techniques are interesting in detecting messages whose malicious content varies regularly.
In this article, we want to focus on the SMS messages that are delivered in the Arabic speaking countries. These messages are generally written in
Arabic or English. The approach we propose is a security model called “SM-Detector” for detecting smishing messages in a mixed mobile environment
that supports Arabic and English. This model is based on the combination of three methods: (i) identification of malicious URLs, (ii) identification of
suspected words and expressions, and (iii) classification of messages based on the BERT pretrained model.
Recently, the emergence of BERT and pretrained models in general has brought natural language processing (NLP) to a new era. The release of
BERT by Google in 2018 is considered one of the most important milestones in the evolution of NLP. It allowed researchers to achieve state-of-the-art
results and provide high performance models in multiple NLP tasks such as Question Answering,6-10 Sentiment Analysis,11-15 Text Classification,16-20
and Machine Translation.21-23 Given the recent success of this pretrained model, we have chosen to use it in our approach as a basis for representing
and classifying text messages.
The main contributions of our article are summarized in the following items:

• A hybrid security model capable of detecting smishing messages in a mixed mobile environment that supports Arabic and English.
• A module for inspecting URLs contained in text messages using VirusTotal.
• Identification of suspicious words, expressions, phone numbers, and emails using regular expression analysis.
• A BERT-based classification to distinguish spam messages from legitimate ones.
• A mobile application that helps users to check whether received messages contain smishing operations.

The remaining parts of the article are organized as follows: Section 2 reviews the related works. Section 3 describes the architecture of our
security model. Section 4 explains in detail the operating principle of the BERT-based classifier. Section 5 presents the experimental results. Finally,
we conclude the article in Section 6.

2 RELATED WORK

With the continuous increase of phishing attacks carried out through short messages in mobile phones, in recent years, several researchers are
focusing more on the study of this phenomenon. A large part of the conducted research has used machine learning and data mining techniques to
analyze the messages and identify those which represent a threat. Some researchers have proposed the use of naive bayes algorithm to classify
text messages as spam or ham.24-26 Other researchers have tried other classifiers such as random forest, Decision Tree, Support Vector Machine,
and AdaBoost5,27 or even a rule-based classification.28 Other researchers have been interested in standardizing and expanding the content of the
messages to improve the classifiers performance.29 And recently, researchers began using deep learning techniques for this task.30,31
GHOURABI 3 of 15

In Reference 24, the authors proposed a model called “Smishing Detector” whose objective is to detect smishing messages with the aim of
reducing false positives. The proposed model contains four modules. The first module is dedicated to analyze the content of text messages and
identify the malicious contents and keywords by the use of Naive Bayes Classification Algorithm. The second module is used to inspect the URL
present in the messages. The third one is dedicated to analyze the source code of the website linked in the messages. The last module is an APK
download detector, its role is to identify if a malicious file is downloaded when the URL is called. The experimental tests carried out by the authors
on this model gave an accuracy of 96.29%.
In recent article,30 Roy et al. proposed the use of deep learning to classify Spam and Not-Spam SMS messages. Their approach consists of
employing two deep learning techniques together: Convolutional Neural Network and Long Short-Term Memory. The goal was to classify text
messages and identify those that are spam and those that are not spam. To evaluate the performance of their approach they compared it with
other machine learning algorithms like Naive Bayes, Random Forest, Gradient Boosting, Logistic Regression, and Stochastic Gradient Descent. The
obtained results showed that CNN and LSTM models were much better than the other machine learning models.
In another paper intended for the detection of smishing messages, Joo et al.25 proposed a model called “S-Detector.” This model contains four
modules: a component for monitoring the SMS activities, an analyzer to analyze the content of the SMS, a determinant to classify and blocks Smishing
text messages, and a database to store the SMS data. In this model, the authors used Naive Bayes classification algorithm to analyze the contents of
the messages.
In another paper,28 Jain and Gupta proposed a rule-based classification technique in order to detect phishing SMS. Their approach has identified
nine rules that can filter phishing SMS from legitimate one. The experimental tests conducted by the authors gave 99% of true negative rate and
92% of true positive rate.
In Reference 5, Sonowal et al. proposed a model called “SmiDCA” for detecting smishing messages based on machine learning algorithms. In this
model, the authors have chosen to use correlation algorithms to extract 39 most relevant features from smishing messages. Then, they evaluated
the performance of their model by applying four machine learning classifiers: random forest, Decision Tree, Support Vector Machine and AdaBoost.
The experimental evaluation of this model gave an accuracy of 96.4% with Random Forest Classifier.
In Reference 27, the authors proposed a feature-based approach for detecting smishing messages. This approach extracts ten features that
the authors say they can distinguish false messages from ham. Then the features was implemented on benchmarked dataset with the use of five
classification algorithms to judge the performance of the proposed approach. The experimental evaluation displayed that the model can detect
smishing messages with 94.20% of true positive rate and 98.74% of overall accuracy.
In Reference 29, the authors proposed a processing method to normalize and expand text messages in order to improve the performance of clas-
sification algorithms when applied with these text messages. The proposed method is based on lexicography, semantic dictionaries and techniques
of semantic analysis and disambiguation. The main idea was to standardize the words and create new attributes in order to expand the original text
and reduce factors that could degrade performance, such as redundancies and inconsistencies.
In Reference 26, Arifin et al. proposed an SMS spam filtering method based on two data mining techniques: FP-growth and Naive Bayes.
FP-growth algorithm is used to extract frequent itemset in text messages and Naive Bayes Classifier is used to classify the messages and filter out
those who are spam. The experimental evaluation conducted by the authors on this approach showed an average accuracy of 98.5%.
In Table 1, we present a comparative summary on different works discussed in this section.
The contribution of our article, compared with existing work, is the proposal of an efficient system for detecting smishing messages which can
deal with both English and Arabic messages. To the best of our knowledge, the model that we propose in this article is the first approach which uses
a combination of heuristic methods and the pretrained BERT model to classify short messages.

3 THE PROPOSED MODEL

In this section, we describe in detail our model “SM-Detector.” The main idea of the proposed model is to process the collected SMSs and apply some
analysis methods to classify them and identify those that are considered spam or phishing messages. In Figure 2, we present the architecture of
“SM-Detector.” In this model, we have chosen to apply three detection methods: (i) identification of malicious URLs, (ii) identification of suspected
words and expressions, and (iii) classification of messages based on BERT. The objective of this variety of methods is to benefit from the advantages of
signature-based detection techniques (through the use of the VirusTotal API to identify malicious URLs and the preparation of a blacklist containing
suspicious expressions, phone numbers, and emails); and also take advantage of machine learning algorithms which are very powerful in identifying
suspicious content which is not recognized by signature-based detection techniques.
The process of the proposed system begins by cleaning up unnecessary information found in the text messages and extracting URLs if found.
Then, three detection methods are applied on these messages:

• Inspection of URLs found in the messages using the VirusTotal API.


4 of 15 GHOURABI

TA B L E 1 Comparative summary of related work

Paper reference Approach objective Used methods Dataset Type


30
Roy et al. Classify SMS and identify spam CNN and LSTM • UCI machine learning repository: SMS
messages spam collection data set32

Mishra et al.24 Identify spam messages and inspect Naive Bayes • UCI machine learning repository: SMS
included URL spam collection data set32
• Pinterest Smishing message images

Joo et al.25 Classify SMS and identify spam Naive Bayes • Private dataset
messages

Jain et al.28 Classify SMS and identify spam Rule-based classification • UCI machine learning repository: SMS
messages spam collection data set32

Sonowal et al.5 Classify SMS and identify spam Random forest, Decision Tree, Support • UCI machine learning repository: SMS
messages Vector Machine, and AdaBoost spam collection data set32
• no-English data from Yadav et al.33

Jain et al.27 Identify spam messages and inspect Feature-based technique Random forest, • UCI machine learning repository: SMS
included URL Naïve Bayes, Support Vector Machine and spam collection data set32
Neural Network • NUS SMS corpus34
• Pinterest Smishing message images

Almeida et al.29 Normalize and expand text messages to Lexicography, semantic dictionaries, and • UCI machine learning repository: SMS
improve the classification techniques of semantic analysis and spam collection data set32
performance disambiguation

Arifin et al.26 Classify SMS and identify spam FP-growth and Naive Bayes • UCI machine learning repository: SMS
messages spam collection data set32

Ghourabi et al.31 Classify SMS and identify spam CNN and LSTM • UCI machine learning repository: SMS
messages spam collection data set32
• private dataset

FIGURE 2 Architecture of SM-Detector


GHOURABI 5 of 15

• Analysis of the content of the message with “regular expression” technique to check whether it contains blacklisted words or numbers.
• Classification of the text messages with the BERT-based classifier to distinguish spam messages and those which are not spam.

The result of each analysis method mentioned above makes it possible to determine whether the message is considered spam or not. The corre-
lation module gathers all these results to make a final decision regarding this message. It only takes one method declaring the message as suspicious
to mark it as spam by the entire system.

3.1 Data cleaning

Data cleaning is the first step in our detection model. The goal of this task is to remove unnecessary words and items from the text messages (such
as punctuation marks and useless symbols) in order to improve the performance of the machine learning model.

3.2 URL inspection with VirusTotal API

VirusTotal is a webservice that analyzes files and URLs for viruses, worms, trojans, and other kinds of malicious content.35 It is a data aggregator that
obtain results from the combined output of different antivirus products like Kaspersky, Symantec, BitDefender, and so on. VirusTotal uses multiple
antivirus engines to scan files or URLs at the same time, and then provides the scan reports to its users.
In our system, we have integrated an API provided by VirusTotal to obtain the scan results. The VirusTotal API allows to build simple scripts to
access the information generated by VirusTotal without the need of using the website interface.35 Our idea is to extract any URL or downloadable
file found in the text messages and send it to VirusTotal through its API to analyze it. The scan result allows us to check whether the URL is malicious
or not.

3.3 Regular expression analysis

A regular expression is a character string (or pattern), which describes, according to a precise syntax, a set of possible character strings. Such patterns
are used by string-searching algorithms to find, for a given text, the sentences that match the pattern description.
Regular expression is very popular in spam filtering. The filtration technique called Heuristics, or rule-based spam filtering, is a set of rules
represented in the form of regular expressions. It allows to search for messages whose content corresponds to very specific characteristics known
to have a high probability of being spam. The popularity of this technology is due to its simplicity, speed, and consistent accuracy. In addition, it is
better than other filtering technologies in the sense that it does not require a training phase.36 For example, in email filtering, a simple heuristic
filtering system may assign an email a score based on the number of patterns (or rules) it matches. If the score of an email is higher than a predefined
threshold, the email will be classified as spam.36
Using regular expressions to find variations of “sensitive” words increases the chances of discovering spam. For example, if a spammer tries to
defeat a keyword filter using the word “viiaaagraa” instead of “viagra,” the regular expression “/ ^vi+a+gra+$/i” (a “v” followed by one or more “i”,
followed by one or more “a,” followed by a “g,” an “r,” and one or more “a,” regardless of the case) allows to find the word.
In our “SM-Detector” model, we used this technique to identify text messages containing suspicious expressions. We prepared a set of rules
based on regular expressions to detect sensitive words or phrases such as “viagra,” “winner,” “your ATM card is blocked,” and so on. In addition, we
created a blacklist of phone numbers and emails that have been reported in previous spam messages. These lists can be updated regularly thanks to
a functionality allowing the users of our system to report messages that consider them spam.

3.4 BERT-based classification

BERT-based classification is the most important phase in our system. In this part, we created a machine learning model for detecting spams from a
collection of Arabic and English text messages. The implementation of this model is based on two components: a BERT model and a fully connected
neural network. The architecture and operating principle of this model is described in more detail in Section 4.

3.5 Correlation

Correlation is the last step in the process of detecting smishing messages. Its role is to collect the results of the three analysis methods described
above and to make decision regarding the text message given as input. If at least one of the analysis methods detects the presence of malicious
6 of 15 GHOURABI

content in the message, then the message will be classified as spam. The correlation module gathers the different causes of this classification (for
example, detection of malicious URL, detection of suspicious expressions) to then send them to the mobile application from which the text message
was sent. The mobile application that is part of our “SM-detector” system allows users to check the status of their text messages and see if they
contain spam. This application informs the user of the presence of a spam message and indicates the reasons that led to consider this message as
suspicious.

4 BERT-BASED CLASSIFICATION

In this section, we introduce our BERT-based model for message classification. We first identify and explain the problem to be solved through this
model. We then briefly describe the BERT’s concept. Finally, we present the basic principle of our classification model and describe its architecture.

4.1 Problem definition

The problem of SMS classification addressed in this section is described as follows. Consider a dataset D composed of n short messages, each of them
is associated with a label y indicating if the message is spam or not. Let x be an input message consisting of k words, denoted as (x1 , x2 , … , xi , … , xk ),
where xi (1 ≤ i ≤ k) refers to the ith word in the message. The objective of our BERT-based classifier is to assign to each input message x a label value
y ∈ {0, 1}, where 0 means that the message is considered non-spam, and 1 means that it is considered spam.

4.2 BERT

BERT (Bidirectional Encoder Representations from Transformers)15 is a new language representation model released by Google in late 2018. The
strength of BERT is its ability to make deep bidirectional representations from text. This means that the model can learn information from left to right
and right to left. The BERT framework includes two steps: pretraining and fine-tuning. During pretraining, the model is trained on large data from
BooksCorpus (800M words) and English corpus from Wikipedia (2500M words).15 Then the pretrained model can be fine-tuned with supplementary
output layer to solve various NLP tasks like Text Classification, Sentiment Analysis, Question/Answering, Machine Translation, and so on.
The BERT’s model architecture is a multilayer bidirectional Transformer encoder. There are two model sizes: BERT Base and BERT Large. BERT
Base has 12 Transformer blocks, a size of 768 for the hidden layer, 12 self-attention heads, and 110M parameters. Although, BERT Large has 24
Transformer blocks, a size of 1024 for the hidden layer, 16 self-attention heads, and 340M parameters.
As the objective of the model proposed in this article is to classify SMS, we describe in this paragraph how to fine-tune the pretrained model for
text classification. In Figure 3, we present an illustration of fine-tuning BERT on single sentence classification. Each input sentence is represented
as a sequence of tokens. The first token of every sequence is always a special classification token noted ([CLS]). As shown in Figure 3, we denote

F I G U R E 3 Illustration of fine-tuning BERT on single sentence


classification15
GHOURABI 7 of 15

the embedding of the input sequence as E, the final hidden vector of the special [CLS] token as C ∈ RH , and the final hidden vector for the ith input
token as Ti ∈ RH , where H is the size of the hidden layer.15 The final hidden vector is then useful to perform classification tasks by connecting it with
additional layers.

4.3 BERT-based classifier

The architecture and operating principle of our BERT-based classifier are presented in Figure 4 and Algorithm 1. The first step in this
process is to tokenize the input message and pad it to the maximum length. Then we add the CLS token to the sequence and offer
it as an input to the BERT model. Next, we extract the last hidden vector of the CLS token from the output of the BERT model.
Finally, we connect this vector to a fully connected neural network followed by a Softmax layer to obtain an output prediction between
0 and 1.

FIGURE 4 Architecture of the BERT-based classifier

Algorithm 1. BERT-FC

Input: x (a text message)


Output: y (message label: 0 or 1)

1: x1 = Tokenize(x)
2: x2 = Pad(x1 )
3: seq = Add_CLS(x2 )
4: last_hidden_vector = BERT(seq)
5: output = fully_connected(last_hidden_vector[CLS])
6: y = softmax(output)
8 of 15 GHOURABI

4.3.1 Tokenize and padding

Let x be an input message consisting of k words, denoted as (x1 , x2 , … , xk ). To transmit our input message to BERT, we have to split it into tokens. This
operation is performed by the tokenizer included with the pretrained BERT model. The tokenization task consists in transforming each text into a
sequence of tokens, each token must be mapped to its index in the tokenizer vocabulary. Since the text messages do not have the same length, we
need to make the sequences in a uniform length. The “padding” operation is thus used to pad the sentences smaller than “max sequence length” with
empty values and truncate the sentences longer than “max sequence length.” After that, the CLS token is inserted at the beginning of the sequence.
The obtained sequence can be represented as:

seq = [CLS, Tok1 , Tok2 , … , TokN ], (1)

where N is the maximum length of the sequence.

4.3.2 Pretrained BERT

In this step, we use a pretrained BERT model for the representation of input messages. The advantage of using a pretrained BERT model is that it
uses a combination of masked language modeling and next sentence prediction on a large corpus including the Toronto Book Corpus and Wikipedia.
In addition to the basic model that is pretrained on English language, BERT offers a multilingual model pretrained on a large corpus of multilingual
data. This multilingual model is very useful in our case since the messages we deal in this system contain mixed words from the Arabic and English
language.
The output of this model is a sequence of hidden-states (HS ) represented as follows

HS = [C, T1 , T2 , … , TN ], (2)

where C ∈ RH is the final hidden state vector of the special [CLS] token, Ti ∈ RH is the final hidden state vector for the ith input token, and H is the
size of the hidden layer.

4.3.3 Fully connected

The goal of this operation is to apply a neural network classifier on the top of BERT model. This classifier is applied only on the final
hidden state vector C of the special (CLS) token (since it provides aggregated representations of the sequence). The employed classifier
is a three-layer fully neural network with input layer of size H (the hidden layer size of BERT) and output layer of size 2 (the number
of labels).

4.3.4 Softmax

To obtain a label prediction from the output of the fully connected network, we use in this stage the Softmax function. This function maps the output
values to be within 0 and 1, allowing to specify the probability of belonging to each class. The Softmax function takes as input a vector z of K real
numbers and provides a probability distribution according to the following formula:

ezi
𝜎(z)i = ∑K for i = 1, … , K and z = (z1 , … , zK ) ∈ RK . (3)
zj
j=1 e

5 EXPERIMENTAL EVALUATION

In this section, we present the results obtained from the experimental tests that we conducted on the model presented in this article.
The implementation of the BERT model and the classification algorithms was performed in Python 3.7 using the TensorFlow and Pytorch
environments.
GHOURABI 9 of 15

5.1 Dataset description

For the evaluation of our approach, we used two types of dataset: (i) the SMS Spam dataset from the UCI Repository29 and (ii) a set of Arabic messages
collected from local smartphones. The SMS Spam dataset is a public set of SMSs labeled messages that have been collected for SMS spam research. It
contains 5574 English messages labeled according being legitimate (ham) or spam. For the second dataset, it contains a set of 2730 Arabic messages
collected from several local smartphones in Saudi Arabia and which are labeled as spam and not-spam. In Table 2, we show some statistics regarding
the used dataset.

5.2 Evaluation measures

To evaluate the performance of the proposed model, we used the standard metrics for classification tasks such as Accuracy, Precision, Recall,
F1-Score, Confusion Matrix, ROC curve, and MCC.

• The confusion matrix is a table which indicates the following measures:


− True Positives (TP): the cases when the actual class of the message was 1 (Spam) and the predicted is also 1 (Spam)
− True Negatives (TN): the cases when the actual class of the message was 0 (Not-Spam) and the predicted is also 0 (Not-Spam)
− False Positives (FP): the cases when the actual class of the message was 0 (Not-Spam) but the predicted is 1 (Spam).
− False Negatives (FN): the cases when the actual class of the message was 1 (Spam) but the predicted is 0 (Not-Spam).
• Accuracy: is the number of messages that were correctly predicted divided by the total number of predicted messages.

(TP + TN)
Accuracy = . (4)
(TP + FP + FN + TN)

• Precision: is the proportion of positive predictions (Spam) that are truly positives.

TP
Precision = . (5)
TP + FP

• Recall: is the proportion of actual Positives that are correctly classified.

TP
Recall = . (6)
TP + FN

• F1-Score: is the harmonic mean of precision and recall.

2 × Precision × Recall
F1-Score = . (7)
Precision + Recall

• Receiver Operating Characteristic (ROC) is the plot of the true positive rate (TPR) (Equation 8) against the false positive rate (FPR) (Equation 9)
at various threshold settings.

TP
TPR = , (8)
TP + FN
FP
FPR = . (9)
FP + TN

• AUC (Area under the ROC Curve) is an aggregate measure of performance across all possible classification thresholds.
• MCC (Matthews correlation coefficient) is regarded as a balanced coefficient which takes into account the four confusion matrix measures
(true positives, false positives, true negatives, and false negatives)

TP × TN − FP × FN
MCC = √ . (10)
(TP + FP)(TP + FN)(TN + FP)(TN + FN)

TA B L E 2 Statistics of the dataset


English Arabic Spam Notspam Total

Number of the messages 5574 2730 785 7519 8304


10 of 15 GHOURABI

5.3 Parameters of the BERT-based classifier

The BERT model used in the proposed approach is a multilingual model. It is pretrained on 104 languages, having 12 Transformer blocks, a size of 768
for the hidden layer, 12 self-attention heads, and 110M parameters. Concerning the classification process, we applied the hyperparameters shown
in Table 3.

5.4 Experimental results

To classify the text messages into Spam and Not-Spam, we employed in this article three analysis methods: (i) inspection for malicious URLs, (ii) reg-
ular expression analysis, and (iii) BERT-based classification. We begin first by presenting the results of the BERT-based classification alone without
taking into account the first two detection methods. Then we add the results of the first 2 methods (URL inspection and regular expression analysis)
to evaluate the overall performance of “SM-Detector.”

5.4.1 Results of the BERT-based classification

To evaluate our classification model, we compared it with the traditional learning algorithms: (i) Support Vector Machine, (ii) K-Nearest Neighbors,
(iii) Multinomial Naive Bayes, (iv) Decision Tree, (v) Logistic Regression, (vi) Random Forest, (vii) AdaBoost, (viii) Bagging classifier, (ix) Extra Trees and
the deep learning algorithms: CNN and LSTM. For all tested classifiers, we used a fivefold Cross-Validation evaluation. To compare the performance
of the classifiers we calculated six measures: Accuracy, Precision, Recall, F1-Score, AUC, and MCC.
In Table 4, we show the results found for all of the previously mentioned algorithms. Starting by the accuracy, our model BERT-FC (BERT with
fully connected network) gave the best score with a value equal to 0.992050, followed by the CNN algorithm with an accuracy of 0.983015. For the
precision, Random Forest was the best with a value equal to 0.998540, BERT-FC is in second position with a precision of 0.983513. Concerning the

TA B L E 3 Hyperparameters used in the experimental evaluation


Hyperparameter Value

Max sequence length 64

Batch size 32

Optimizer AdamW

Learning rate 5e−5

Number of epochs 3

TA B L E 4 Results of the classification algorithms

Text representation Classification model Accuracy Precision Recall F1-Score ROC AUC MCC

TF-IDF Support Vector Machine 0.980005 0.971796 0.885477 0.813267 0.905364 0.878774

k-nearest neighbors 0.905444 0.000000 0.000000 0.000000 0.500000 0.000000

Multinomial Naive Bayes 0.978318 0.898278 0.883838 0.870383 0.930068 0.872179

Decision Tree 0.968201 0.857547 0.824319 0.794450 0.890379 0.807858

Logistic Regression 0.961334 0.926966 0.759106 0.642793 0.818731 0.753503

Random Forest 0.977476 0.998540 0.864986 0.763371 0.881618 0.862250

AdaBoost 0.970489 0.920974 0.826416 0.750636 0.871999 0.815957

Bagging classifier 0.971212 0.894177 0.836824 0.787736 0.889018 0.823542

Extra Trees 0.979643 0.975584 0.881965 0.805484 0.901678 0.875989

Word embedding CNN 0.983015 0.950589 0.864559 0.905362 0.929950 0.897367

LSTM 0.980727 0.929419 0.862148 0.894415 0.927608 0.884630

BERT BERT-FC 0.992050 0.983513 0.930546 0.955831 0.964403 0.952174


GHOURABI 11 of 15

Recall, BERT-FC has the best score with a value of 0.930546, then SVM with a value of 0.885477. For the F1-Score, BERT-FC is once again the best
with a score of 0.955831 against 0.905362 for the CNN algorithm. Concerning the AUC measurement, our model gave the best score with a value
of 0.964403, Multinomial Naive Bayes is the second with a score of 0.930068. Finally, for the MCC measure, BERT-FC again gives the best score
with a value of 0.952174. In conclusion, the BERT-FC model has shown its effectiveness by giving the best result in 5 among 6 evaluation measures.

5.4.2 Results of all detection methods

In this section, we want to evaluate the performance of the other two detection methods (URL inspection and regular expression analysis). For this
purpose, we have divided the dataset into two parts: 80% for training and 20% for testing. We started by applying only the BERT-based classification.
Then, we included the URL inspection and the regular expression analysis. In Table 5, we present a small comparison between the BERT-based model
alone and the hybrid model combining the three methods. The results show that the combination of the three methods mentioned above improved
the detection performance in terms of the 6 comparison criteria: Accuracy, Precision, Recall, F1-Score, AUC, and MCC. This improvement is due
to the identification of three new spam messages by the first two methods, which could not be detected during the BERT-based classification. In
Figures 5 and 6, we present the confusion matrix and the ROC curve of the classification model and the overall detection model.

5.5 Comparative analysis

In Table 6, we compare the results of our model with the previous results described in the related work section. This comparison shows that our
model outperforms all other models in terms of accuracy. This is mainly due to the use of the new BERT representation model which has shown its
superiority over the old text representation methods such as Word Embedding and TF-IDF.

5.6 Discussion

The hybrid model that we have proposed in this article, has shown its effectiveness by giving the best result compared with other classification algo-
rithms and other related works. It achieved a remarkable accuracy of 99.63% in a mixed environment of Arabic and English messages. The obtained
results prove that a good combination of the three different methods (URL inspection, regular expression analysis and BERT-based classification),

TA B L E 5 Comparison between BERT-FC model and the overall detection model

Accuracy Precision Recall F1-Score AUC MCC

BERT-FC Model 0.994582 0.987013 0.955975 0.971246 0.977322 0.968406

Overall detection model 0.996388 0.987261 0.974843 0.981013 0.986756 0.979041

FIGURE 5 Confusion Matrix and ROC curve of BERT-FC


12 of 15 GHOURABI

FIGURE 6 Confusion Matrix and ROC curve of the overall detection model

TA B L E 6 Comparison with previous works

Paper reference Used methods Accuracy


30
Roy et al. CNN and LSTM 99.44%
24
Mishra et al. Naive Bayes 96.29%
25
Joo et al. Naive Bayes –

Jain et al.28 Rule-based classification 99% of true negative and 92% of


true positive

Sonowal et al.5 Random forest, Decision Tree, Support Vector Machine, and 96.4%
AdaBoost

Jain et al.27 Feature-based technique, Random forest, Naïve Bayes, Support 98.74%
Vector Machine and Neural Network

Almeida et al.29 Lexicography, semantic dictionaries and techniques of semantic –


analysis and disambiguation

Arifin et al.26 FP-growth and Naive Bayes 98.5%

Ghourabi et al.31 CNN and LSTM 98.37%

Our model BERT and fully connected neural network 99.63%

as we have done in this article, represents a very promising solution to create efficient systems for text messages classification and spam detection.
The strength of our system is that it is based on a multilingual BERT model, which allows it to support several languages other than English and Ara-
bic. Moreover, the proposed approach can also be applied with the messaging services of social networks (like Facebook, Instagram, WhatsApp, and
more) since they are also exposed to the same dangers of the SMS service.
However, there are certain limitations regarding our proposed approach. The experimental dataset is not very large, which may prevent the
detection of other types of smishing messages. To improve the performance of our model and deploy it on a large scale, we need to work on expanding
the training dataset by enriching it with other types of text messages.

5.7 Examples of mobile application interfaces

The mobile application is a very important means of communication between the user and our system “SM-Detector.” Thanks to the mobile appli-
cation, the user can check the presence of smishing operations in his text messages. For example, in Figure 7(A), we present a text message received
by the mobile application. By clicking on the button “Check Message,” the user can send the message to “SM-Detector” to check if it is spam or not.
In Figure 7(B), we show the response of the verification. It is a spam message attempting to steal the user’s Apple ID through a phishing URL. This
message has been reported spam by the 3 detection methods used in “SM-Detector.”
In Figure 8, we present another example of spam message detected by two methods: regular expression analysis and BERT-based classification.
It is a message trying to deceive the user by informing him that he won £1000 cash. The regular expression module identified the presence of several
suspicious words in the message such as “winner,” “you have been selected,” “£,” “cash,” “award,” and “call.”
GHOURABI 13 of 15

F I G U R E 7 Malicious message
attempting to steal the user’s Apple ID
detected by SM-Detector

F I G U R E 8 Malicious message detected by SM-Detector trying to deceive the user by


informing him that he won £1000 cash
14 of 15 GHOURABI

6 CONCLUSION

In this article, a security model based on the pretrained BERT model for smishing message detection was presented. The model is titled
“SM-Detector.” Its goal is to combine different analysis methods in order to identify malicious content in text messages. “SM-Detector” employed
three method: (i) identification of malicious URLs, (ii) identification of suspected words and expressions, and (iii) classification of text messages. The
classification algorithm used in this model is created by an association between a BERT model and a fully connected neural network. The choice of
this classification method was justified by an experimental comparison with other machine learning algorithms. Also, a mobile application associ-
ated with “SM-Detector” was developed. The application provides a user-friendly interface allowing the users to check received SMS. Experimental
results shown that our security model achieved an accuracy of 99.63%, a precision of 98.72%, a recall of 97.48%, an F1-Score of 98.1%, an AUC of
98.67%, and an MCC of 97.9%. The obtained results prove that “SM-Detector” can significantly improve the security of smartphones by minimizing
the risks related to smishing attacks in mobile environments.
The proposed solution has been tested on messages sent through the SMS service. Nevertheless, it can also be applied to other services in
mobile devices such as social media messaging (WhatsApp, Facebook messenger) and emails. As future work, we plan to improve the accuracy of
“SM-Detector” and extend its functionality to support social media messaging and emails.

DATA AVAILABILITY STATEMENT


The data that support the findings of this study are available from the corresponding author upon reasonable request.

ORCID
Abdallah Ghourabi https://orcid.org/0000-0001-5628-9016

REFERENCES
1. Godwin N. SMS usage statistics in 2019. market predictions for 2020-2023 and beyond; 2020. https://www.smseagle.eu/2020/04/06/sms-usage-
statistics-in-2019-market-predictions-for-2020-2023-and-beyond/. Accessed July 15, 2020.
2. GSMA The mobile economy; 2020. https://www.gsma.com/mobileeconomy/wp-content/uploads/2020/03/GSMA_MobileEconomy2020_Global.pdf.
Accessed July 15, 2020.
3. Gilbert N. 10 mobile marketing trends for 2020/2021; 2020. https://financesonline.com/mobile-marketing-trends/, Accessed July 15, 2020.
4. Flowroute Flowroute survey finds consumers overwhelmingly prefer SMS to email and voice for business interactions; 2016. https://www.flowroute.
com/press-type/flowroute-survey-finds-consumers-overwhelmingly-prefer-sms-to-email-and-voice-for-business-interactions/. Accessed 15, July
2020.
5. Sonowal G, Kuppusamy KS. SmiDCA: an anti-smishing model with machine learning approach. Comput J. 2018;61(8):1143-1157. https://doi.org/10.
1093/comjnl/bxy039
6. Alberti C, Lee K, Collins M. A BERT baseline for the natural questions. ArXiv; 2019:abs/1901.08634.
7. Chadha R, Bert QA. Attention on Steroids; 2019. https://github.com/ankit-ai/BertQA-Attention-on-Steroids.
8. Du Y, Pei B, Zhao X, Ji J. Deep scaled dot-product attention based domain adaptation model for biomedical question answering. Methods.
2020;173:69-74. https://doi.org/10.1016/j.ymeth.2019.06.024
9. Luo D, Su J, Yu S. A BERT-based approach with relation-aware attention for knowledge base question answering. Paper presented at: Proceedings of the
2020 International Joint Conference on Neural Networks (IJCNN). Glasgow, UK; 2020:1-8.
10. Zhang Y, Xu G, Wang Y, et al. A question answering-based framework for one-step event argument extraction. IEEE Access. 2020;8:65420-65431. https://
doi.org/10.1109/ACCESS.2020.2985126
11. Colón-Ruiz C, Segura-Bedmar I. Comparing deep learning architectures for sentiment analysis on drug reviews. J Biomed Inform. 2020;103539.
12. Abu Farha I, Magdy W. A comparative study of effective approaches for Arabic sentiment analysis. Inf Process Manag. 2021;58(2):102438. https://doi.
org/10.1016/j.ipm.2020.102438
13. Ray B, Garain A, Sarkar R. An ensemble-based hotel recommender system using sentiment analysis and aspect categorization of hotel reviews. Appl Soft
Comput. 2021;98:106935. https://doi.org/10.1016/j.asoc.2020.106935
14. Du C, Sun H, Wang J, Qi Q, Liao J. Adversarial and domain-aware BERT for cross-domain sentiment analysis. Paper presented at: Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics Association for Computational Linguistics; 2020:4019-4028
15. Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. NAACL-HLT; 2019.
16. Adhikari A, Ram A, Tang R, Lin J. DocBERT: BERT for document classification. ArXiv; 2019:abs/1904.08398.
17. Croce D, Castellucci G, Basili R. GAN-BERT: generative adversarial learning for robust text classification with a bunch of labeled examples. Paper
presented at: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics;
2020:2114-2119
18. Yu S, Su J, Luo D. Improving BERT-based text classification with auxiliary sentence and domain knowledge. IEEE Access. 2019;7:176600-176612. https://
doi.org/10.1109/ACCESS.2019.2953990
19. Dong Y, Liu P, Zhu Z, Wang Q, Zhang Q. A fusion model-based label embedding and self-interaction attention for text classification. IEEE Access.
2020;8:30548-30559. https://doi.org/10.1109/ACCESS.2019.2954985
20. Sergio GC, Lee M. Stacked DeBERT: all attention in incomplete data for text classification. Neural Netw. 2020. https://doi.org/10.1016/j.neunet.2020.12.
018
21. Yang J, Wang M, Zhou H, et al. Towards making the most of BERT in neural machine translation. AAAI; 2020.
22. Lample G, Conneau A. Cross-lingual language model pretraining. NeurIPS; 2019.
GHOURABI 15 of 15

23. Zhu J, Xia Y, Wu L, et al. Incorporating BERT into neural machine translation. ArXiv 2020:abs/2002.06823.
24. Mishra S, Soni D. Smishing detector: a security model to detect smishing through sms content analysis and URL behavior analysis. Future Generat Comput
Syst. 2020;108:803-815. https://doi.org/10.1016/j.future.2020.03.021
25. Joo JW, Moon SY, Singh S, Park JH. S-Detector: an enhanced security model for detecting Smishing attack for mobile computing. Telecommun Syst.
2017;66(1):29-38. https://doi.org/10.1007/s11235-016-0269-9
26. Arifin DD, Bijaksana MA. Enhancing spam detection on mobile phone Short Message Service (SMS) performance using FP-growth and Naive Bayes clas-
sifier. Paper presented at: Proceedings of the 2016 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob). Bandung, Indonesia; 2016:80-84.
27. Jain AK, Gupta BB. Feature based approach for detection of smishing messages in the mobile environment. J Inf Technol Res. 2019;12(2):17-35. https://
doi.org/10.4018/JITR.2019040102
28. Jain AK, Gupta B. Rule-based framework for detection of smishing messages in mobile environment. Proc Comput Sci. 2018;125:617-623. The 6th
International Conference on Smart Computing and Communications. https://doi.org/10.1016/j.procs.2017.12.079
29. Almeida TA, Silva TP, Santos I, Hidalgo JMG. Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowl Based
Syst. 2016;108:25-32. New Avenues in Knowledge Bases for Natural Language Processing. https://doi.org/10.1016/j.knosys.2016.05.001
30. Roy PK, Singh JP, Banerjee S. Deep learning to filter SMS spam. Future Generat Comput Syst. 2020;102:524-533. https://doi.org/10.1016/j.future.2019.
09.001
31. Ghourabi A, Mahmood MA, Alzubi QM. A hybrid CNN-LSTM model for SMS spam detection in Arabic and English messages. Future Internet.
2020;12(9):156. https://doi.org/10.3390/fi12090156
32. Almeida TA, Hidalgo JMG, Yamakami A. Contributions to the study of SMS spam filtering: new collection and results. Paper presented at: Proceedings of
the 11th ACM Symposium on Document Engineering. Association for Computing Machinery; 2011:259-262; New York, NY.
33. Yadav K, Kumaraguru P, Goyal A, Gupta A, Naik V. SMSAssassin: crowdsourcing driven mobile-based system for SMS spam filtering. Paper presented at:
Proceedings of the 12th Workshop on Mobile Computing Systems and Applications. Association for Computing Machinery; 2011:1-6; New York, NY.
34. Chen T, Kan MY. Creating a live, public short message service corpus: the NUS SMS corpus. Lang Resour Eval. 2013;47:299-355. https://doi.org/10.1007/
s10579-012-9197-9
35. VirusTotal VirusTotal, how it works; 2020. https://support.virustotal.com/hc/en-us/articles/115002126889-How-it-works. Accessed July 15, 2020.
36. Carpinter J, Hunt R. Tightening the net: a review of current and next generation spam filtering tools. Comput Secur. 2006;25(8):566-578. https://doi.org/
10.1016/j.cose.2006.06.001

How to cite this article: Ghourabi A. SM-Detector: A security model based on BERT to detect SMiShing messages in mobile environments.
Concurrency Computat Pract Exper. 2021;e6452. https://doi.org/10.1002/cpe.6452

You might also like