Machine Learning for SQL Injection Prevention
Machine Learning for SQL Injection Prevention
INTRODUCTION
Most of the applications that we use every day are web-based applications. Organizations choose
to make the applications accessible over the Internet to increase the exposure they gain. Being
exposed to Internet increases the security challenges that come along with uncontrolled access.
With the growth of Internet, we are used to performing various kinds of transactions online. All
the data entered by the users during these transactions on web applications or websites is stored
in some kind of a database. Relational Databases can be communicated with a language called
Structured Query Language, i.e. SQL. Using SQL to launch attacks on databases and manipulate
them to do what the user wants is a form of a web hacking technique called SQL Injection
Attack. SQL Injection Attacks have become an increasing cause of worry for the cyber
defenders. The query contains user input which can be malicious. It is important to make our
web applications learn that the user input comes from an external source and it can be
mmalicious, so we need to process it before it actually gets executed. It is the responsibility of
the programmer to write intelligent code which prevents any such intrusion. But due to
negligence or ignorance, the user input is kept unprocessed which provides a room for the
attacker to intrude in to the system.
With the enormous increase of quantity and quality, web applications are changing the living and
work habits of people. Web applications are usually designed to interact with back-end databases
and number of web applications were attacked due to web vulnerabilities (Wang Xin,2012).
According to OWASP's top Ten 2010, which lists the top ten vulnerabilities in web applications,
the first security risk is SQL injection (OWASP,2010). SQL injection attack (SQLIA) generates
malicious queries in a web application that change the developer's intended functions when an
attacker specifies crafted input. Using these inappropriate inputs, the attacker could retrieve
important and sensitive data from databases, such as passwords of administrators or customers'
personal information. If these data are published to the public, it may cause a large number of
damage to these web application owners and users. SQLIA is usually caused by input contents
from clients. Although we can use some methods to escape potentially harmful characters in
client request messages, there may be a number of conditions that need to be concerned (Preecha
1
Noiumkar, 2012). For instance, if an application considers some special symbols such as single-
quotes are legal in name strings but are illegal in password strings, thus the application can’t just
simply forbid these symbols from all input contents and need a more complicated method to
refine it soundly In this paper, we design a system based on machine learning for preventing
SQLI attack. This system captures HTTP requests to obtain input contents and classifies them by
Bayesian classifier, and then detects malicious contents and terminates attacks. In addition, we
created a tool for generating training samples automatically through classifying and analyzing
legitimate and injection patterns in the real world. We evaluated this learning-based method by
various different types of injection patterns, and verified the actual effect with a SQL injection
attack tool. Unlike previous approaches, our method is effective with a simple detection
mechanism and independent of databases and web applications, in other hands, any web
application with any database can be protected by the method well and need not be modified,
Lately, the use of machine learning algorithms to detect and prevent various cyber security
threats are being debated largely. While the power of using supervised and unsupervised learning
techniques to detect security threats cannot be questioned, the computing resources and time
required to execute such complex algorithms remains a major concern for the ever-advancing
cyber security community. Tremendous research work has been done on using various machine
learning algorithms to detect SQL Injection attacks. There is no single perfect algorithm or
technique in machine learning that can be applied to a particular problem. A problem needs to be
tested against various algorithms falling under classification or regression techniques, and the
results need to be compared, before finalizing a particular approach, for maximum accuracy.
In this paper an approach called Naïve Bayes algorithm will be used to detect and prevent SQL
Injection attacks. In this paper, A system is designed based on machine learning for preventing
SQLI attack. This system captures HTTP requests to obtain input contents and classifies them by
Bayesian classifier, and then detects malicious contents and terminates attacks. In addition, a tool
was created for generating training samples automatically through classifying and analyzing
legitimate and injection patterns in the real world. We evaluated this learning-based method by
various different types of injection patterns, and verified the actual effect with a SQL injection
attack tool. Unlike previous approaches, our method is effective with a simple detection
mechanism and independent of databases and web applications, in other hands, any web
2
application with any database can be protected by the method well and need not be modified and
also in this this paper we begin with an introduction to SQL Injection attacks and the need to
build a better SQL Injection detection system. The related work done in this area so far will be
described. All the significant implementations and research work done so far provides enough
literature review to learn from and improve on the problem. There will also be an introduction to
supervised learning, which is the generic approach we are using to solve this problem. This
section also explains and delves a bit into the algorithm considered for this experiment, i.e.
In the previous year alone, SQL Injection and Remote Code execution attacks contributed to
more than four-fifths of the detected web-based attacks. SQL Injection attacks remain one of the
most pervasive cyber-attacks. Many techniques have been developed to deal with such attacks,
however cyber hackers still seem to successfully get through the various defense mechanisms in
place to deal with SQL Injection attacks (Chris.Anley et al 2016) Lately, the use of machine
learning algorithms to detect and prevent various cyber security threats are being debated largely.
While the power of using supervised and unsupervised learning techniques to detect security
threats cannot be questioned, the computing resources and time required to execute such
complex algorithms remains a major concern for the ever-advancing cyber security community.
Tremendous research work has been done on using various machine learning algorithms to
detect SQL Injection attacks.
1.2 MOTIVATION
Every day, millions of data are loaded through various channels on the web by users and user
input can be malicious. Since web application are easily accessible, they are prone to many
vulnerabilities which if neglected can cause harm. Attackers make use of these loopholes to gain
unauthorized access by performing various illegal activities. SQL Injection is one of such attack
which is easy to perform but difficult to detect because of its varied types and channel and this
may result in theft, leak of personal data or loss of property, therefore a machine learning
approach is required to effectively combat the intrusion attempts.
3
1.3 AIMS AND OBJECTIVES OF THE STUDY
In this work, we focus on creating a model that would detect and prevent attacks carried out on
web-based application. This is implemented using the Naives bayes algorithm uses simple
classifiers, mostly decision trees, in a sequential manner, to provide result.
4
CHAPTER TWO
LITERATURE REVIEW
2.1 INTRODUCTION
Many researchers have been studying a number of methods to detect and prevent SQL injection
attacks, and most preferred techniques are web framework, static analysis, dynamic analysis,
combined static and dynamic analysis, and machine learning techniques. The web framework
uses filtering methods for user input data. However, because it is only able to filter some special
characters, other detouring attacks cannot be prevented. The static analysis method involves the
inspection of computer code without actually executing the program (C. Gould et al 2015). The
main idea behind static analysis is to identify software defects during the development phase.
Static analysis is applied to find potential violations matching a vulnerability pattern, so it is
more effective than the filtering method. But attacks having the correct parameter types cannot
be detected. The main limitation of the method is that it cannot detect SQL injection attacks
patterns that are not known beforehand, and explicitly described in the specifications. The
dynamic analysis can be seen as the next logical step of static analysis. It inspects the behavior of
a running system and does not require access to the internals of the system; however, this method
is not able to detect all SQL injection attacks b. A combined static and dynamic analysis method
can compensate for the weaknesses of each method and is highly proficient in detecting SQL
injection attacks (W.G. Hal fond et al 2005). A machine learning method of a combined method
can detect unknown attack.
SQL Injection is an attack that tries to get unauthorized access to a database by injecting a code
and exploiting the SQL query (J. Abirami et al 2015). Let us understand this through a simple
example.
Say there is a banking website that lets user’s login by entering their username and password.
When the user enters a valid username and password, the authentication will pass, and the user
will be allowed to login.
5
Fig. 1. Example login page in browser
Following will be the query constructed in case of an authorized login attempt where:
Username = usr
Password = usr123
SQL Query: SELECT * FROM users WHERE name = ‘usr’ and password = ‘usr123’
However, it is also possible that a user with malicious intent enters the following input in
Username = usr
Password = ’ or ‘1’ = ‘1
SQL Query: SELECT * FROM users WHERE name = ‘usr’ and password = ‘’ or ‘1’ = ‘1’
Since 1=1 will always be true, this user will always be allowed to login to the website. The user
gets unauthorized access to someone else's account details and the possession of this information
could result in serious consequences for the person whose account information was stolen. This
is a case of theft and a violation of data privacy. The aim of the attackers using SQL Injection is
to exploit the database that is connected to a website or a web application. It is extremely
important to protect such databases against SQL Injection attacks in order to protect the
important data stored in them. Letting an unauthorized user get access to a database can result in
many unauthorized actions on the database like deleting tables, retrieving important information
6
and many more terrifying things, and SQL Injection attacks make all of this possible. SQL
Injection Attacks can be broadly classified into the following three categories:
In SQL, UNION operator is used to join two SQL statements or queries. Union Based SQL
Injection takes advantage of this feature to make the database return desired results in addition to
the intended results. This is achieved by injecting another query in place of plain text and using
UNION keyword at the beginning of the query.
A simple example would be searching for a song in a database. When we enter the name of the
song in the search field, following query is formed.
SQL Query: SELECT * FROM songs WHERE name = 'magic' However, a malicious user might
enter the following in the song search field to exploit the database.
SQL QUERY: SELECT * FROM songs WHERE name = 'magic' UNION DROP TABLE songs
7
This might end up in deleting the entire songs table. Here the user is just trying to run two
queries at one time and has used UNION keyword to combine both the queries. Using this
approach, the second part of the query can be used to perform any desired unauthorized action on
the database.
Error based SQL Injection approach works by passing an invalid input in the query and thereby
triggering an error in the database. This is achieved by forcing the database to perform an action
that will lead to an error. The user can then look for the errors generated by the database and use
those errors to gain information on how to further manipulate the database by exploiting the SQL
query.
Blind SQL Injection attack is a technique where the malicious user asks questions to the database
and decides on further course of action based on the returned answers. This is the most difficult
type of SQL Injection attack since no information is known about the database. This type of
approach is used when the database returns generic errors like 'Syntax Error'. Blind SQL
Injection attacks are further classified into Boolean Based SQL
SQL injection is harmful and the risks associated with it provide motivation for attackers to
attack the database. The main consequences of these vulnerabilities are attacks on the following
characteristics:
i. Authorization: Critical data that are stored in a vulnerable SQL database may be altered
by a successful SQLIA.
ii. Authentication: If there is no any proper control on input fields inside the authentication
page, it may be possible to login into a system as a normal user without knowing the
authenticated user.
8
iii. Confidentially: Usually databases are consisting of sensitive data such as personal
information, credit card numbers and/ or social numbers. Therefore, loss of confidentially
is a terrible problem with SQL Injection vulnerability.
iv. Integrity: By a successful SQLIA not only an attacker reads sensitive information, but
also, it is possible to change or delete this private information.
v. Database Fingerprinting: The attacker can determine the type of database being used in
backend so that he can use database-specific attacks that correspond to weakness in a
particular database management system.
Attackers are constantly probing the Internet at-large and campus web sites for SQL injection
vulnerabilities. They use tools that automate the discovery of SQL injection flaws, and attempt
to exploit SQL injection primarily for financial gain (e.g. stealing personally identifiable
information which is then used for identity theft).
Because so many modern applications are data-driven and accessible via the web, SQL Injection
vulnerabilities are widespread and easily exploited. Additionally, because of the prevalence of
shared database infrastructure, a SQL Injection flaw in one application can lead to the
compromise of other applications sharing the same database instance.
Cryptographic Approach
This literature described the types of SQL injection attack and proposed a runtime approach that
uses cryptographic hash function algorithm to prevent attackers from bypassing the login
authentication on the system (Mihir Gandhi et al 2013). The idea was to prevent database
9
engines from processing statements like ―OR 1=1-- that attacker uses to bypass authentication
mechanism. Because all relational database engines with no exception process query with ̳OR
1‘= ‘1‘-- as always true, after ―WHRE‖ condition, which give attackers opportunity to gain
access to the system without proper authentication and authorization. In this method when users
try to log into the system, the credential will be converted into the corresponding hash value and
compared with hash value stored in database which was created by the user the first time. In this
case, when attacker try to log in to the system with muhd ̳OR ‘1= ̳1—as his credential. The
system will automatically prevent access because the hash value of two different credentials can
never be the same (unless if collision occurs which is very rare and it only happens in MD5). The
importance of this method is that the hashed values of credentials are very difficult to be reverse
even if attacker managed to gain access to the database table.
This method does not prevent database finger printing attack because the approach does not
really detect any malicious value, having only transformed it into a different format to prevent
fooling the relational database engine. The only difference is that hash function is applied to
encrypted credentials and comparison is being made between hashed encrypted credentials
produce dynamically at runtime. The problem with this approach is that it is time consuming.
User must wait for the system to compute the hash value of credentials to let him access the
system. The advantage of this approach is that it adds security to user credential such that even if
the user account is compromised the attacker would not be able to deduce anything since
valuable information is not stored in clear text.
XML Approach
This approach proposed hybrid method to detect and prevent SQLIA. In this method XML was
used to authenticate system users (R. Joseph Manoj et al 2014). The use of XML here is similar
in concept to the cryptographic approach by not allowing relational database engines to directly
process queries with tautology attack (OR1 =1--) in which result is always evaluated to be true.
When users try to log into the system XML file maker intercepts the user query and converts it
into XML format which will then be sent to the XML file. XML file then pass user credential to
the Xschema validator and Xschema validator compares the user query that is produced by XML
maker with user query threshold that was already defined as a legitimate query in Xschema file.
If the dynamic query matches the query defined in Xschema file then the user is allowed to
10
connect to the systems; otherwise access is denied and error returns to the user. This method has
a number of drawbacks, First, storing data for processing in XML format is not a good idea
because the processing speed is very slow in which each tree nodes of tree has to be visited.
Second, using predefined patterns in protection mechanism has limitation as users can inject
values that programmers have not been addressed in user threshold which results in successful
attack. Third, blocking SQL keywords, operators and clauses is not effective.
In this approach functions were created that generate user credentials from database and
temporary store them in XML file authentication. In this approach if user credential found to be
equal with one in XML the user is allowed to connect to the system; otherwise they are blocked.
Using XML to authenticate is not a good idea for two reasons. One causes delays generating and
authenticating users second, attackers can inject XML to get user‘s credentials. XML was used to
filters suspicious characters, operators and SQL keywords (Indrani Bal Sundaram et al 2011).
The advantage of using XML is that after XPath authentication user credentials are encrypted to
prevent password sniffing. Also, in terms of database storage, this method used copy of login
table only for authentication in case of any successful attack on login table the original login
table remain unaffected.
This Literature proposed a method that uses pattern matching algorithm to detect and prevent
SQL injection attack. This method required users to define set of query related attacks that would
be used to compare with dynamic queries issued by the user (M. Amutha et al 2013). In this case
user queries would not be transformed into XML format or broken into SQL keyword, operator
or characters but rather dynamic SQL query entered by user will be compared with one defined
in anomaly detection pattern. If the anomaly of the dynamic query is equal to the one defined by
user threshold, then the query is rejected and if anomaly score is high than the anomaly defined
in user threshold then an alarm will be sent to the administrator to analyze the query manually. If
the query is found to be a new attack then it is added to the anomaly detection library. The
drawback of this method is that queries have to be analyzed manually by administrator and
manually added into the anomaly detection library.
Parsing Approach
11
In this method query model was created (SQLstatementsafe) as a library that contains syntax
grammar of SQL statement (Narayanan et al 2011). This syntax grammar was built on two
perspectives, one for single query and one for stacked query. It also contains tree structure of
SQL query. When user issues SQL query from the website URL the query will first be checked
in SQL statement safe to see if the query is single and conforms to the semantics of a legitimate
SQL statement. If a query was found to be single and conforms with defined syntax, it is allowed
to be processed in the database engine, and if it is single and does not conform to syntax in SQL
statement safe it rejected. If the query is found to be a stacked query it then passes to parse tree
comparison is done to check each action in each query. If both actions are found to be legitimate,
an action query is allowed; otherwise it is blocked indicating that the user has modified one of
the stacked queries. The problem of this method is that it allows an attacker to perform database
fingerprinting by executing an inference attack. The importance of this method is that there is no
need of adding URL when new a page is created.
This Literature provide an overview of SQL injection attack types and proposed methods in
which static queries of web applications are built in the form of a tree-like structure. The
assumption was all legitimate queries have same semantic syntax (Ravi. Kumar et al). If the
query entered by the user does not the match structure of the defined query, the query will be
blocked. In addition, the queries model uses filter layers to prevent database finger printing.
Using filters to prevent database fingerprinting is more effective than using stored procedures as
attackers can exploit store procedure to perform malicious action in database. It also proposed a
method that uses parse tree structure to determine the legitimate structure of SQL statement. In
this method query model needs to be constructed in the form of tree-like structure and stored
inform of a library which will be used for later comparison when users submit a query for
processing (Shi, Cong-conget al 2012). This method requires manual intervention if the anomaly
detected is higher than the anomaly detected by the approach.
A number of articles was reviewed and some information was gathered from web sites to gain
sufficient knowledge about SQL injection attacks. The Following are the papers from which
different important strategies to prevent SQL injection attacks was covered.
12
1. “Using Parse Tree Validation to Prevent SQL Injection Attacks” ACM, The techniques for sql
injection discovery was covered. This paper also covered very well the SQL parse tree validation
that was mentioned in the report (Gregory T. Buehrer, Bruce W. Weide, and Paolo A. G.
Sivilotti, 2005)
2. “The Essence of Command Injection Attacks in Web Applications” ACM, they covered the
techniques to check and sanitize input query using SQLCHECK, it uses the augmented queries
and SQLCHECK grammar to validate query. (Zhendong Su and Gary Wassermann, 2006)
3. “Using Automated Fix Generation to Secure SQL Statements” IEEE CNF, they covered brief
background, SQL statement, and vulnerability replacement methods (Stephen Thomas and
Laurie Williams, 2007).
5. “Preventing SQL Injection Attacks in Stored Procedures”, they also provided a novel
approach to shield the stored procedures from attack and detect SQL injection from sit (Ke Wei,
M. Muthuprasanna, Suraj Kothari, 2007). This method combines runtime
check with static application code analysis so that they can eliminate vulnerability to
attack. The key behind this attack is that it alters the structure of the original SQL
statement and identifies the SQL injection attack. The method is divided in two phases, one is
offline and another one is runtime. In the offline phase, stored procedures use a parser to pre-
process and detect SQL statements in the execution call for runtime analysis. In the
runtime phase, the technique controlled all runtime generated SQL queries related with the user
input and checks these with the original structure of the SQL statement after getting input from
the user. Once this technique detects the malicious
SQL statements it prevents the access of these statements to the database and provides details
about attack.
13
6. With the development of AI and Machine Learning, (A. Joshi et al 2014) proposed using the
machine learning algorithms to prevent SQL Injection attacks. This paper detects
SQL Injection attacks using a machine learning algorithm called Naïve Bayes. Naïve Bayes is a
classification machine learning algorithm that assumes that a particular incident is unrelated to
and is independent of other all other incidents. In this paper Naïve Bayes classifier is used to
classify between malicious and non-malicious SQL queries. To train the model they have used a
training dataset that consists of both malicious and non-malicious
SQL queries and also every query in this training data is labelled. Labelling the data helps the
model to learn what is malicious and what is non-malicious. This type of model is called a
supervised machine learning model. Once the model has been trained it is then used on the test
dataset to verify if the model is classifying the SQL queries correctly. The model suggested in
this paper promises to even detect those SQL Injection attacks that are new and whose signatures
are not known. Machine learning will be used to detect SQL Injection attacks but with a different
machine learning algorithm.
SUPERVISED LEARNING
Machine learning algorithms can be broadly classified as Supervised Learning algorithms and
Unsupervised Learning algorithms. Supervised learning is a type of machine learning that in its
simplest form, works in the following manner. We have a dataset called as training dataset and
each individual component of this dataset is labelled. The supervised learning model basically
learns the relationship between the data and the label and then uses this learnt information to
classify new data that it has never seen before. This new data is called as the test dataset. We use
test dataset to determine the accuracy of a supervised learning algorithm. This is how we predict
the values or classify never before seen data using supervised machine learning. Supervised
learning algorithms can further be broadly classified as Regression algorithms and Classification
algorithms.
14
Fig. 3 Supervised Machine Learning
Regression algorithms are used for predicting a value for an individual data component, for
example, predicting the value of a house, or predicting stocks. Usually the values predicted by
the Regression algorithms are quantitative or numerical. Classification algorithms are used for
classifying individual data components. For example, classifying if the vehicle is a truck or a car,
or predicting if it will rain or not on a given day. Classification algorithms are used to predict
qualitative values. Figure 1 shows the derived hierarchy of Classification and Regression
algorithms. In this section we will focus on Classification algorithms in Machine Learning, and
more specifically the classification model which is Naïve Bayes algorithm.
1. CRYPTOGRAPHIC APPROACH
The problem with this approach is that it is time consuming.
User must wait for the system to compute the hash value of credentials to let him access
the system
2.XML APPROACH
Storing data for processing in XML format is not a good idea because the processing
speed is very slow in which each tree nodes of tree has to be visited.
3.PARSING APPROACH
15
CHAPTER THREE
METHODOLOGY
In this project, we want to determine queries that is Malicious or non-Malicious query. For
deciding this, we are using supervised Naive Bayes Algorithm. In the learning process, the
training dataset is read by the application from text files and puts each data to the learning
method of the classifier. The classifier generates feature vectors from received data by blank
separation and tokenizing method and learns it by machine learning method. We are using MVC
Framework.
16
Naïve Bayes is a classification model in supervised learning that is based on Bayes Theorem.
The essence to Naïve Bayes is that it assumes that the presence of a feature in a data model is
unrelated to the presence of other features. In short it assumes that all the features in a data are
conditionally independent of each other, hence it gets its name ‘Naïve Bayes’.
From the literature review it has been found that none of the methods can absolutely detect all
kind of SQL injections, each method is particularly favorable to specific type of injection. After
literature review conceptual framework was performed to develop a method which can offer
better results and detecting more kind of injections simultaneously. The proposed methodology
either returns “YES”(in binary returns '1') for non-malicious query or “NO” (in binary returns '0')
for malicious query. The proposed Navies Bayes algorithm can be described as follows:
I. Calculate the probabilities for each attribute for being malicious or not.
II. Find number of malicious query in training test=Nm
III. Find number of non-malicious query in training test=Nnm
IV. Find total number of queries to be tested=Tq For each query do the following.
a. Lqm=probability product of all attributes for being malicious
b. Lqnm=probability Product of all attributes for being non malicious
c. Pprior(M)=Nm/Tq
d. Pprior(NM)=Nnm/Tq
V. Ppost(M)=Pprior(M)*Lqm
VI. Ppost(NM)=Pprior(NM)*Lqnm
VII. Depending on the Ppost the query is declared as malicious or non malicious
17
P (B|A) = Probability of B being true given that A is true.
P (A) = Probability of A regardless of the other data.
P (B) = Probability of B regardless of other data.
There are two types of probabilities in this equation. Prior probability, that is P(A) and
P(B) and posterior probability, that is P(A|B) and P(B|A). P(A|B) and P(B|A) are also
called conditional probabilities since they are condition to something.
The benefits of using Naïve Bayes model could be many including the following:
1. It can be trained on a small dataset.
2. It is easier to compute and requires less computational resources.
18
Figure 6: system flowchart for detecting sql injection using the algorithm above
CHAPTER FOUR
4.1 IMPLEMENTATION
The implementation phase of this project is concerned with planning the structure of the system
to detect sql injection. It contains hardware specifications and requirements, choice of software
for the development of the web-based detection system.
This entails the creation of the server backend database, tables and queries and the creation of the
User Front-end- Graphic User Interface that will be needed in the prompt functionality of the
system. The choice of programming language used are hypertext pre-processor (Php) language
and MySQL database system. The main reason for choosing PHP as the choice of programming
language is that as the internet grows, the use of cloud computing has experienced tremendous
increase over the last few years. The system allows access to a single object by different users at
the same time in different location with the integrity of the sytem maintained. Also HTML, CSS
19
and JavaScript were also used to format the pages and also add improved functionalities to the
whole project.
4.3.1 Software Requirement: The following are the software needed for the smooth running
of the new system:
1. Windows (XP, Vista, 7, 8 or 10), Linux, mac Os.
2. Web browser (Google Chrome, mozilla firefox,internet explorer,safari e.t.c)
3. Xampp or docker server.
4. sublime text editor, vscode, notepad++ etc.
5. PHPMY Admin for manipulating the database tables.
4.3.2 Hardware Requirement: The above listed software will work perfectly with the under
listed specifications as a computer is not complete without either the software or the hardware.
The respective user interface of the new system design to be human engineered, attractive and
functional are explained after login page of the SQL injection detect/prevention system.
20
Figure 7: Application Start-up page
Fig 8. Each time the application is run by the user, the login page is displayed where the user is
going to enter his or her username and password for authentication, if the username and
password exists then the application leads the user to the page that matches its password.
21
Figure 10: invalid login page
22
Figure 11b: injection logs
23
Figure 13: User Table
This dataset consists of plain-text sentences and has around four thousand rows. The plaintext
dataset has been created with payloads received from html forms. The dataset consists of a
24
combination of URL’s, special characters, textual data and numerical data. Gathering a dataset
for this problem was challenging as no datasets with public access to actual SQL Injection
attacks that were launched are available. The dataset for SQL Injections has been created from a
tool named Libinjection (Nick Galbreath 2012 ). Libinjection is an open source tool that is used
for penetration testing of web applications. It passes SQL Injections as payload to web
applications and analyses if the application is vulnerable to SQL Injection attack. By the use of
this tool, all the payloads generated by libinjection were captured for a particular instance and a
dataset consisting of all these payloads is used as the SQL Injection dataset. This dataset contains
around six thousand SQL Injection of all three types that are, Union Based, Error Based and
Blind SQL Injections
25
Figure 17: Sample plain text data set
26
Fig 19 Identifying union based sql injection
27
Fig 24: Line of code that show accuracy of the prediction
Tokenization
In machine learning analysis that consists of text-based datasets, tokenization usually is the first
and most important step in data pre-processing. In tokenization, sequence of characters are
broken down into small pieces called ‘tokens’. Tokenization also includes removing certain
characters sometimes. This practice is usually performed in word-based learning.
Experiments
Similar approach as above is used to implement Naïve Bayes Algorithm. Tokens are created and
grouped together based on their occurrence.
Where: P (sqli) is the Probability of SQL Injection P (plain) is the Probability of plain text
28
Step 2 : Next we calculate the likelihood of a new input being a SQL Injection or plain-text.
Likelihood is similar to the G-test score in this case and is calculated based on the number of
tokens matching with the new input. Likelihood is calculated in the following manner.
Calculate total number of tokens that match with the user input = Match_Token_Cnt
Where:
Likelihood (sqli) = The possibility that the new input is a SQL Injection
Alogorthm Accuracy
Naives Bayes Classifier 92.8
Fig 25 Naives Bayes Classifier result
29
CHAPTER FIVE
5.1 SUMMARY
The first chapter is a general introduction to the project specifying goals and objectives and
creating a starting baseline for the successful completion of the project. Chapter two focuses on
the background and relevant theories of the problem area, the related works and methodologies
used in the project. The third chapter is concerned with the system designed. The different
areas/aspects of the SQL injection were analyzed and a description of the design and database
was documented.
The fourth chapter consists basically of the system screenshots, system specifications and
implementation results. SQL Injection attacks remain to be one of top concerns for cyber
security researchers. Signature based SQL Injection detection methods are no longer reliable as
attackers are using new types of SQL Injections each time. There is a need for SQL Injection
detection mechanisms that are capable of identifying new, never before seen attacks.
5.2 CONCLUSION
In this thesis, the SQL Injection detection problem is approached by applying machine learning
algorithms. Classification method is used to classify the incoming traffic as a SQL Injection or
plain text. Two machine learning classification algorithms are implemented on the problem,
which are, Naïve Bayes Classifier and this machine learning model provides results with an
accuracy of 92.8%.. From this project it can be concluded that machine learning approaches can
be used for SQL injection detection.
5.3 RECOMMENDATIONS
The system should also be audited at intervals for the purpose of best practices, full
implementation is strongly advised.
The naive bayes algorithm should be reviewed from time to time to prevent any type of attack or
injection.
Dataset for the machine learning should be updated at good interval so as to keep the system safe
30
REFERENCES
C. Gould, Zhendong Su and P. Devanbu, "JDBC checker: a static analysis tool for
Indrani hBalasundaram, E. Ramaraj 2011. "An Approach to Detect and Prevent SQL Injection
Attacks in Database Using Web Service."IJCSNS International Journal of
Computer Science and Network Security 11.1 95-100.
J. Abirami, R. Devakunchari and C. Valliyammai, 2015 "A top web security vulnerability
Mihir Gandhi, JwalantBaria 2013. SQL INJECTION Attacks in Web Application International
Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-
2, Issue- 6.
R. Joseph Manoj et al 2014. An Approach to Detect and Prevent Tautology Type SQL Injection
in Web Service Based on XSchema validation. International Journal Of
Engineering And Computer Science ISSN: 2319-7242 Volume 3 Issue 1
Shi, Cong-cong, et al.( 2012) "A New Approach for SQL-Injection Detection."Instrumentation,
Measurement, Circuits and Systems. Springer Berlin Heidelberg,. 245-254.
31
Natarajan K & Subramani S, 2012. “Generation of SQL-injection free secure algorithm to detect
and prevent SQL-injection attacks”, Procedia Technology 4 Elsevier
Ltd,pp.790–796
32
APPENDIX
buildscript {
ext.kotlin_version = '1.3.50'
repositories {
google()
jcenter()
}
}
}
rootProject.buildDir = '../build'
subprojects {
project.buildDir = "${rootProject.buildDir}/${project.name}"
}
subprojects {
project.evaluationDependsOn(':app')
}
task clean(type: Delete) {
delete rootProject.buildDir
}
include
':app'
33
pluginsFile.withReader('UTF-8') { reader -> plugins.load(reader) }
}
plugins.each { name, path ->
def pluginDirectory = flutterProjectRoot.resolve(path).resolve('android').toFile()
include ":$name"
project(":$name").projectDir = pluginDirectory
}
B8E99CEC8791B020B0BAE001 /* Pods-
Runner.debug.xcconfig */,
5C05B83D9C75CD0548B7F03A /* Pods-
Runner.release.xcconfig */,
A1FC04FBA59BC2144EDE67AE /* Pods-
Runner.profile.xcconfig */,
);
name = Pods;
path = Pods;
sourceTree = "<group>";
};
9740EEB11CF90186004384FC /* Flutter */ = {
isa = PBXGroup;
children = (
3B80C3931E831B6300D905FE /* App.framework */,
3B3967151E833CAA004F5970 /*
AppFrameworkInfo.plist */,
9740EEBA1CF902C7004384FC /* Flutter.framework
*/,
9740EEB21CF90195004384FC /* Debug.xcconfig */,
7AFA3C8E1D35360C0083082E /* Release.xcconfig
*/,
9740EEB31CF90195004384FC /* Generated.xcconfig
*/,
34
);
name = Flutter;
sourceTree = "<group>";
35
36
97C146EB1CF9000F007C117D /* Frameworks */,
97C146EC1CF9000F007C117D /* Resources */,
9705A1C41CF9048500538489 /* Embed Frameworks
*/,
3B06AD1E1E4923F5004D2608 /* Thin Binary */,
ED84FD48B78BF5D0C89ADA96 /* [CP] Embed
Pods Frameworks */,
);
buildRules = (
);
dependencies = (
);
name = Runner;
productName = Runner;
productReference = 97C146EE1CF9000F007C117D /*
Runner.app */;
productType = "com.apple.product-type.application";
};
/* End PBXNativeTarget section */
37
/* Begin PBXProject section */
97C146E61CF9000F007C117D /* Project object */ = {
isa = PBXProject;
attributes = {
LastUpgradeCheck = 1020;
ORGANIZATIONNAME = "The Chromium Authors";
TargetAttributes = {
97C146ED1CF9000F007C117D = {
CreatedOnToolsVersion = 7.3.1;
LastSwiftMigration = 1100;
};
};
};
buildConfigurationList = 97C146E91CF9000F007C117D /*
Build configuration list for PBXProject "Runner" */;
compatibilityVersion = "Xcode 3.2";
developmentRegion = en;
hasScannedForEncodings = 0;
knownRegions = (
en,
Base,
);
mainGroup = 97C146E51CF9000F007C117D;
productRefGroup = 97C146EF1CF9000F007C117D /*
Products */;
projectDirPath = "";
projectRoot = "";
targets = (
97C146ED1CF9000F007C117D /* Runner */,
);
38
};
/* End PBXProject section */
/* Begin PBXResourcesBuildPhase section */
97C146EC1CF9000F007C117D /* Resources */ = {
isa = PBXResourcesBuildPhase;
buildActionMask = 2147483647;
files = (
97C147011CF9000F007C117D /*
LaunchScreen.storyboard in Resources */,
3B3967161E833CAA004F5970 /*
AppFrameworkInfo.plist in Resources */,
97C146FE1CF9000F007C117D /* Assets.xcassets in
Resources */,
97C146FC1CF9000F007C117D /* Main.storyboard in
Resources */,
);
runOnlyForDeploymentPostprocessing = 0;
};
/* End PBXResourcesBuildPhase section */
/* Begin PBXShellScriptBuildPhase section */
3B06AD1E1E4923F5004D2608 /* Thin Binary */ = {
isa = PBXShellScriptBuildPhase;
buildActionMask = 2147483647;
files = (
);
inputPaths = (
);
name = "Thin Binary";
outputPaths = (
);
runOnlyForDeploymentPostprocessing = 0;
39
shellPath = /bin/sh;
shellScript = "/bin/sh
\"$FLUTTER_ROOT/packages/flutter_tools/bin/xcode_backend.sh\" thin";
};
9740EEB61CF901F6004384FC /* Run Script */ = {
isa = PBXShellScriptBuildPhase;
buildActionMask = 2147483647;
files = (
);
inputPaths = (
);
name = "Run Script";
outputPaths = (
);
runOnlyForDeploymentPostprocessing = 0;
shellPath = /bin/sh;
shellScript = "/bin/sh
\"$FLUTTER_ROOT/packages/flutter_tools/bin/xcode_backend.sh\" build";
};
D47FECD597310191CB10CF43 /* [CP] Check Pods Manifest.lock */
={
isa = PBXShellScriptBuildPhase;
buildActionMask = 2147483647;
files = (
);
inputFileListPaths = (
);
inputPaths = (
"${PODS_PODFILE_DIR_PATH}/Podfile.lock",
"${PODS_ROOT}/Manifest.lock",
);
40
name = "[CP] Check Pods Manifest.lock";
outputFileListPaths = (
);
outputPaths = (
"$(DERIVED_FILE_DIR)/Pods-Runner-
checkManifestLockResult.txt",
);
runOnlyForDeploymentPostprocessing = 0;
shellPath = /bin/sh;
shellScript = "diff
\"${PODS_PODFILE_DIR_PATH}/Podfile.lock\" \"${PODS_ROOT}/Manifest.lock\"
> /dev/null\nif [ $? != 0 ] ; then\n # print error to STDERR\n echo \"error: The
sandbox is not in sync with the Podfile.lock. Run 'pod install' or update your CocoaPods
installation.\" >&2\n exit 1\nfi\n# This output is used by Xcode 'outputs' to avoid re-
running this script phase.\necho \"SUCCESS\" > \"${SCRIPT_OUTPUT_FILE_0}\"\n";
showEnvVarsInLog = 0;
};
ED84FD48B78BF5D0C89ADA96 /* [CP] Embed Pods Frameworks */
={
isa = PBXShellScriptBuildPhase;
buildActionMask = 2147483647;
files = (
);
inputPaths = (
);
name = "[CP] Embed Pods Frameworks";
outputPaths = (
);
runOnlyForDeploymentPostprocessing = 0;
shellPath = /bin/sh;
shellScript = "\"${PODS_ROOT}/Target Support Files/Pods-
Runner/Pods-Runner-frameworks.sh\"\n";
41
showEnvVarsInLog = 0;
};
/* End PBXShellScriptBuildPhase section */
/* Begin PBXSourcesBuildPhase section */
97C146EA1CF9000F007C117D /* Sources */ = {
isa = PBXSourcesBuildPhase;
buildActionMask = 2147483647;
files = (
74858FAF1ED2DC5600515810 /* AppDelegate.swift
in Sources */,
1498D2341E8E89220040F4C2 /*
GeneratedPluginRegistrant.m in Sources */,
);
runOnlyForDeploymentPostprocessing = 0;
};
/* End PBXSourcesBuildPhase section */
/* Begin PBXVariantGroup section */
97C146FA1CF9000F007C117D /* Main.storyboard */ = {
isa = PBXVariantGroup;
children = (
97C146FB1CF9000F007C117D /* Base */,
);
name = Main.storyboard;
sourceTree = "<group>";
};
97C146FF1CF9000F007C117D /* LaunchScreen.storyboard */ = {
isa = PBXVariantGroup;
children = (
97C147001CF9000F007C117D /* Base */,
);
42
name = LaunchScreen.storyboard;
sourceTree = "<group>";
};
/* End PBXVariantGroup section */
/* Begin XCBuildConfiguration section */
249021D3217E4FDB00AE95B9 /* Profile */ = {
isa = XCBuildConfiguration;
baseConfigurationReference =
7AFA3C8E1D35360C0083082E /* Release.xcconfig */;
buildSettings = {
ALWAYS_SEARCH_USER_PATHS = NO;
CLANG_ANALYZER_NONNULL = YES;
CLANG_CXX_LANGUAGE_STANDARD = "gnu+
+0x";
CLANG_CXX_LIBRARY = "libc++";
CLANG_ENABLE_MODULES = YES;
CLANG_ENABLE_OBJC_ARC = YES;
CLANG_WARN_BLOCK_CAPTURE_AUTORELEASING = YES;
CLANG_WARN_BOOL_CONVERSION = YES;
CLANG_WARN_COMMA = YES;
CLANG_WARN_CONSTANT_CONVERSION =
YES;
CLANG_WARN_DEPRECATED_OBJC_IMPLEMENTATIONS = YES;
CLANG_WARN_DIRECT_OBJC_ISA_USAGE =
YES_ERROR;
CLANG_WARN_EMPTY_BODY = YES;
CLANG_WARN_ENUM_CONVERSION = YES;
CLANG_WARN_INFINITE_RECURSION = YES;
CLANG_WARN_INT_CONVERSION = YES;
43
CLANG_WARN_NON_LITERAL_NULL_CONVERSION = YES;
CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF
= YES;
CLANG_WARN_OBJC_LITERAL_CONVERSION =
YES;
CLANG_WARN_OBJC_ROOT_CLASS =
YES_ERROR;
CLANG_WARN_RANGE_LOOP_ANALYSIS =
YES;
CLANG_WARN_STRICT_PROTOTYPES = YES;
CLANG_WARN_SUSPICIOUS_MOVE = YES;
CLANG_WARN_UNREACHABLE_CODE = YES;
CLANG_WARN__DUPLICATE_METHOD_MATCH
= YES;
"CODE_SIGN_IDENTITY[sdk=iphoneos*]" =
"iPhone Developer";
COPY_PHASE_STRIP = NO;
DEBUG_INFORMATION_FORMAT = "dwarf-with-
dsym";
ENABLE_NS_ASSERTIONS = NO;
ENABLE_STRICT_OBJC_MSGSEND = YES;
GCC_C_LANGUAGE_STANDARD = gnu99;
GCC_NO_COMMON_BLOCKS = YES;
GCC_WARN_64_TO_32_BIT_CONVERSION =
YES;
GCC_WARN_ABOUT_RETURN_TYPE =
YES_ERROR;
GCC_WARN_UNDECLARED_SELECTOR = YES;
GCC_WARN_UNINITIALIZED_AUTOS =
YES_AGGRESSIVE;
GCC_WARN_UNUSED_FUNCTION = YES;
GCC_WARN_UNUSED_VARIABLE = YES;
IPHONEOS_DEPLOYMENT_TARGET = 8.0;
44
MTL_ENABLE_DEBUG_INFO = NO;
SDKROOT = iphoneos;
SUPPORTED_PLATFORMS = iphoneos;
TARGETED_DEVICE_FAMILY = "1,2";
VALIDATE_PRODUCT = YES;
};
name = Profile;
};
249021D4217E4FDB00AE95B9 /* Profile */ = {
isa = XCBuildConfiguration;
baseConfigurationReference =
7AFA3C8E1D35360C0083082E /* Release.xcconfig */;
buildSettings = {
ASSETCATALOG_COMPILER_APPICON_NAME =
AppIcon;
CLANG_ENABLE_MODULES = YES;
CURRENT_PROJECT_VERSION = "$
(FLUTTER_BUILD_NUMBER)";
ENABLE_BITCODE = NO;
FRAMEWORK_SEARCH_PATHS = (
"$(inherited)",
"$(PROJECT_DIR)/Flutter",
);
INFOPLIST_FILE = Runner/Info.plist;
45
46
);
defaultConfigurationIsVisible = 0;
defaultConfigurationName = Release;
};
rootObject = 97C146E61CF9000F007C117D /* Project object */;
}
47