TJ 15 2021 1 112-120
TJ 15 2021 1 112-120
https://doi.org/10.31803/tg-20210205101347
Umar Farooq
Abstract: In the current era, SQL Injection Attack is a serious threat to the security of the ongoing cyber world particularly for many web applications that reside over the internet.
Many webpages accept the sensitive information (e.g. username, passwords, bank details, etc.) from the users and store this information in the database that also resides over
the internet. Despite the fact that this online database has much importance for remotely accessing the information by various business purposes but attackers can gain unrestricted
access to these online databases or bypass authentication procedures with the help of SQL Injection Attack. This attack results in great damage and variation to database and
has been ranked as the topmost security risk by OWASP TOP 10. Considering the trouble of distinguishing unknown attacks by the current principle coordinating technique, a
strategy for SQL injection detection dependent on Machine Learning is proposed. Our motive is to detect this attack by splitting the queries into their corresponding tokens with
the help of tokenization and then applying our algorithms over the tokenized dataset. We used four Ensemble Machine Learning algorithms: Gradient Boosting Machine (GBM),
Adaptive Boosting (AdaBoost), Extended Gradient Boosting Machine (XGBM), and Light Gradient Boosting Machine (LGBM). The results yielded by our models are near to
perfection with error rate being almost negligible. The best results are yielded by LGBM with an accuracy of 0.993371, and precision, recall, f1 as 0.993373, 0.993371, and
0.993370, respectively. The LGBM also yielded less error rate with False Positive Rate (FPR) and Root Mean Squared Error (RMSE) to be 0.120761 and 0.007, respectively. The
worst results are yielded by AdaBoost with an accuracy of 0.991098, and precision, recall, f1 as 0.990733, 0.989175, and 0.989942, respectively. The AdaBoost also yielded high
False Positive Rate (FPR) to be 0.009.
Keywords: Boosting; ensemble learning; Light GBM; SQL injection; web security
In this section, we will briefly mention out all the ten This type is used to detect such parameters that are
types of SQL injection attack. vulnerable to injection and then extract data from the
identified database. In this attack, attacker tries to extract all
2.1 Tautologies information about database and structure. This can be
secured by verifying inputs from user and avoiding the
The attacker uses a conditional query wherein the generation of error messages from database [7].
‘WHERE’ clause is used to inject and make the condition a Example: SELECT * FROM accountTable WHERE user
tautology that always happens to be true. In example login= ’umar”’ AND passwd =
“SELECT * FROM Users WHERE User-id = 1 or 1=1”, the
query will result all the data in the database the condition of 2.6 Inference
WHERE clause is true. This can be secured by restricting the
users to input special characters like single quotes, double This type is used to detect such parameters that are
quotes, equality, and other symbols that are used to make the vulnerable to injection and then extract data from the
malicious queries [7]. database with schema identified. This attack is launched on
Example: SELECT * FROM accountTable WHERE secured databases and is of two types: Inference blind SQL
user login= or 1=1 injection and Inference time SQL injection [7].
Example: 1; IF SYSTEM_USER='sa' SELECT 1/0
2.2 Piggy-Backed Query ELSE SELECT 5
This type is used to retrieve data, modify database, 2.7 Alternate Coding
execute commands and perform Denial of Services (DOS)
attack. In this attack, attacker tries to inject other malicious This type is used to escape from being detected. In this
queries along with the normal/original query. The original attack, attacker injects encoded text to bypass detection
query is true and executed normally while as additional techniques with the help of signatures like EXEC (), Char (),
malicious queries are injected without checking. This can be ASCII (), BIN (), HEX (), UNHEX (), BASE64 (), DEC (),
secured by avoiding execution of multiple statements and ROT13 (), etc. This can be secured by verifying user inputs
checking for delimiter in all queries [7]. and prohibition of meta-characters [7].
Example: SELECT * FROM accountTable WHERE Example: SELECT * FROM accountTable WHERE
user login=umar AND passwd=; drop accountTable user – user login= ’umar’;exec(char(0x59842 352646f776e)) AND
AND pin=221 passwd =’farooq’ AND pin =; SHUTDOWN;–;
This type is used to bypass authentication and extract all SELECT * FROM Accounts WHERE accountName =
data from the database. In this attack, attacker inserts a ‗admin‘--‗AND password = ‗‘
This statement logs the hacker as admin user [8].
2.9 Blind Injection mechanism of role-based access [14]. The detection rate with
this model is 93%, however future attack cannot be detected
This type is used for asking Boolean (true/false) with this data and the classifier relies on the labeled data.
questions and the information is extracted depending upon
the behavior of the web page. The web page functions 4 METHODOLOGY
normally if the injection attack is true, otherwise the web
page functions differently [8]. The main motive of the proposed model is to detect SQL
Injection attack. The whole procedure is performed in four
2.10 Timings Attacks stages:
1) The first stage focuses on collecting the dataset that
This type is used to derive information with the help of contains proper SQL injection attack queries. For this
If-Then statements where the attacker notes the timing delays issue, we created a dataset that contains SQL queries,
of responses from the database [8]. SQL injection attack queries, and plain text. The
Generally, SQL injection attack is divided into three labelling of the dataset is done in this stage.
types depending upon the mode of transfer of incoming and 2) The second stage deals with extracting all the features
outgoing data. The three types are in-band, out-of-band, and from all the queries and selecting the best of them (a.k.a.
inferential [9]. In in-band SQL injection attack, the attacker Feature extraction and feature selection). Tokenization is
extracts the information from the same channel that is used used in this stage to divide the queries into tokens.
for sending the query or performing the attack. In out-of-band 3) The third stage deals with training the model. The model
SQL injection attack, the attacker extracts the information is trained in this phase with 70% of the dataset (a.k.a.
with the help of another channel like email. In inferential Training part).
SQL injection attack, the attacker does not extract the 4) The fourth stage is focused on using the 30% of dataset
information using any channels rather launches other attacks that we separated from the collected dataset for testing
to analyze the behavior of the web application. and evaluating the proposed model with the selected best
feature set (a.k.a. Testing part).
3 RELATED WORK
4.1 Dataset
Multiple studies and researches have been carried out so
far on the field of SQL injection and it’s detection by using The most important part in detecting a SQL injection
various approaches like static & dynamic analysis, combined attack is collecting a meaningful dataset that contains SQL
technique, machine learning, Hash technique, Black Box injection attack queries. The main contribution in this paper
testing, etc. [10]. is a labelled dataset that we manually collected for the said
Static analysis checks whether each stream from a source problem. The dataset not only contains SQL injection attack
to a sink is dependent upon an info approval and additionally queries but also normal SQL injection queries and plain text
input purifying routine [11]; though dynamic analysis queries so that the proposed model will properly comprehend
depends on progressively mining the developer's planned and differentiate between normal and attacking SQL queries.
query structure on any information and recognizes assaults The dataset is collected in three phases: 1) the normal SQL
by contrasting it against the structure of the real given query injection queries are collected in first phase, 2) the SQL
[12]. injection attack queries are collected in the second phase, and
AMNESIA, as a consolidated methodology, is a model- 3) the plain text is collected in the third phase. We collected
based method that consolidates the static and dynamic these queries in the text format and applied labelling and
analysis for detection and prevention of SQL injection preprocessing methods on it and then converted it to a csv
attacks. It uses static analysis in order to make the SQL query file. We applied tokenization on the dataset and formed a new
models at the time of accessing the database. It then uses tokenized dataset. The dataset contains a total of 35198
dynamic analysis before the queries are sent to database and queries with 21 features. The dataset has the following three
compares them with the already built statically models [10]. categories:
But there are some queries and code snippets generation
approaches that make this model less efficient with more 4.1.1 Non-Malicious or Normal SQL Queries
error rate [13].
Hidden Markov Model (HMM) has been presented to These queries, non-malicious in nature, are used to
detect malicious queries with the help of machine learning in create, maintain, and retrieve database in the form of tables
two phases: training and running phase. The first phase (relational database). The tokens (keywords) used in this type
focuses on collecting known malicious and benign queries are: (rename, drop, delete, insert, create, exec, update, union,
and the second phase focuses on detecting injection attacks. set, Alter, database, and, or, information_schema, load_file,
Author, by himself, cleared that WHERE clause and select, shutdown, cmdshell, hex, ascii). Also the dangerous
piggybacked queries cannot be detected by this model [4]. characters used in this type are: --, #, /*, ', '', ||, \\, =, /**/,@@.
Detection of SQL injection attack based on Naïve Bayes
machine learning algorithm was proposed combined with the
4.1.2 SQL Injection Attack Queries/Malicious SQL Queries 4.1.3 Plain Text
These queries are used to execute malicious SQL These are simply in the form of plain text. The tokens
statements in a web application and bypass the security (keywords) used in this type are alphabets and digits. The
measures. These queries are also used to add, modify, and plain text is used in this dataset in order to make sure that the
delete records in a database in an unrestricted way. The proposed model properly comprehends and differentiated
tokens (keywords) used in this type are: , *, ; , _, -, (, ), =, {, between the SQL query, SQL injection query and the plain
}, @, ., , &, [, ], +, -, ?, %, !, :, \, /. Also the SQL tokens used text that the user inputs in the login node of any web app.
are: where, table, like, select, update, and, or, set, like, in, The detailed description of the collected dataset
having, values, into, alter, as, create, revoke, deny, convert, (features) is given below in Tabs. 1 and 2.
exec, concat, char, tuncat, ASCII, any, asc, desc, check,
group by, order by, delete from, insert into, drop table, union,
join.
Table 2 Description of labels recognize the greater part of SQIA types like
S. No. Label Description Count Ratio redundancies/tautologies, union, piggybacked,
1 0 It represents the normal SQL queries 6888 19.57%
It represents the SQL injection attack
illegal/logically incorrect, alternate encodings and stored
2 1 18369 52.19% procedures which are dealt with the same as SQL queries.
queries
3 2 It represents the plain text 9941 28.24% Let us take the example of or 1=1 to understand
the concept of tokenization.
4.2 Tokenization By applying the tokenization to the above query, the
output is given below and is in accordance with the features
The keywords used in SQL injection attack are used to listed in Tab. 1:
launch operations on the database tables. These keywords
play an important role in launching SQL injection attack as
or 1=1
perform over the testing data we applied three and five-fold Table 7 MAE report of our proposed model
cross-validation where we split the dataset into 3 and 5 parts, MAE
Classifier Partition Strategy 3-CV 5-CV
respectively. The advantage of cross validation is that all the GBM 0.010321 0.011590
observations are utilized for both training and testing the AdaBoost Training Set = 70% 0.011553 0.011553
models, and each observation is used for testing exactly once. XGBoost Testing Set = 30% 0.011742 0.011742
Light GBM 0.009280 0.009280
5 RESULTS AND DISCUSSION
Table 8 MSE report of our proposed model
MSE
As per the experiments that we conducted, we come to Classifier Partition Strategy 3-CV 5-CV
conclusion that our proposed system is enough to detect SQL GBM 0.014678 0.016590
injection attack queries from normal and plain text queries AdaBoost Training Set = 70% 0.016856 0.016856
with 21 features. We focused on making the features as much XGBoost Testing Set = 30% 0.017992 0.017992
as possible in order to make the proposed model robust and Light GBM 0.014583 0.014583
detect all types of SQL injection attack queries, efficiently.
Table 9 RMSE report of our proposed model
To evaluate the performance of our proposed model we
RMSE
applied the algorithms, ensemble boosting in nature, on the Classifier Partition Strategy 3-CV 5-CV
testing data (30% of the original dataset). The classification GBM 0.121152 0.128805
results that were evolved by the proposed model are near AdaBoost Training Set = 70% 0.129830 0.129830
perfection and are depicted in the below tables and figures. XGBoost Testing Set = 30% 0.134135 0.134135
We separated the results in different tables, where in Light GBM 0.120761 0.120761
every table represents different classification metrics such as
Table 10 FPR report of our proposed model
accuracy (Acc.), precision (Pr.), recall (Re.), f1 score (f1), False Positives
false positive rate (FPR), root mean squared error (RMSE), Classifier Partition Strategy 3-CV 5-CV
mean absolute error (MAE), and mean squared error (MSE), GBM 0.008 0.009
to analyze the behavior of our system properly. The results AdaBoost Training Set = 70% 0.009 0.010
are depicted in below Tabs. 3-14. XGBoost Testing Set = 30% 0.008 0.008
Light GBM 0.007 0.007
FP
=FPR or 1 − Recall (8)
FP + TN
0 1966 12 21
1 7 5473 37
2 6 20 3018
Figure 2 Classification report
Table 12 Confusion matrix of GBM
GBM The error report, in graphical form, of our proposed
Actual system is given in Fig. 3.
0 1 2
Predicted
0 2078 12 15
1 3 5461 22
2 8 26 2935
0 2060 21 37
1 7 5388 41
2 4 28 2974
0 2095 6 17
1 1 5418 17 The classification reports evaluated by our four models
2 11 18 2977 are given in Fig. 4.
Figure 4 Classification report from GBM, AdaBoost, XGBM, and LGBM, respectively
Figure 5 Classification report from GBM, AdaBoost, XGBM, and LGBM, respectively (continuation)
Figure 6 ROC results from GBM, AdaBoost, XGBM, and LGBM, respectively
5.3 Roc Curves terms of accuracy. Our proposed model dominates other
existing models in terms of accuracy with less error rate.
The ROC values evaluated by our algorithms are given
in Tab. 15. Table 16 Comparative analysis
Classifiers/Models Accuracy
Table 15 ROC values of our proposed models SVM, Naïve Bayes, GBM, REGEX [15] 97%
Algorithms GBM AdaBoost XGBoost Light GBM Neural Network system [16] 96.8%
ROC Value 0.995449 0.997657 0.999548 0.999845 Genetic- fuzzy rule-based system [17] 98.4%
SVM [18] 98%
K-means [19] 98.36%
5.4 Comparative Analysis Our Proposed model (GBM, AdaBoost, XGBM, LGBM) 99.34%
Author’s contact:
Umar Farooq,
Department of Computer Science & Technology (Cyber Security),
Central University of Punjab,
City Campus, Mansa Road, Bathinda 151001, Punjab, India
[email protected]