0% found this document useful (0 votes)

80 views10 pages

Detection of Attack-Targeted Scans From The Apache HTTP Server Access Logs

This paper proposes a method to detect attack-targeted scans from Apache HTTP Server access logs. The method analyzes log files to distinguish normal visits, crawlers/bots, and potential attacks such as SQL injection and XSS attacks. The approach was tested on sample and real log data, and detection performance was analyzed and compared to other common techniques. Further development suggestions are also discussed.

Uploaded by

s_samit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views10 pages

Detection of Attack-Targeted Scans From The Apache HTTP Server Access Logs

Uploaded by

s_samit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/316738947

Detection of attack-targeted scans from the Apache HTTP Server access logs

Article in Applied Computing and Informatics · April 2017

DOI: 10.1016/j.aci.2017.04.002

CITATIONS READS

12 1,911

4 authors, including:

Ferhat Ozgur Catak Ensar Gul

Norwegian University of Science and Technology Istanbul Sehir University
34 PUBLICATIONS 122 CITATIONS 30 PUBLICATIONS 95 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Cyber Security of Cyber-Physical Systems and IoT View project

Machine Learning Based Cyber Security View project

All content following this page was uploaded by Ferhat Ozgur Catak on 06 November 2017.

The user has requested enhancement of the downloaded file.

Applied Computing and Informatics xxx (2017) xxx–xxx

Contents lists available at ScienceDirect

Applied Computing and Informatics

journal homepage: www.sciencedirect.com

Detection of attack-targeted scans from the Apache HTTP Server access

logs
Merve Basß Seyyar a,⇑, Ferhat Özgür Çatak b, Ensar Gül a
a_ _
Istanbul Sßehir University, Istanbul, Turkey
b
Tubitak/Bilgem, Kocaeli, Turkey

a r t i c l e i n f o a b s t r a c t

Article history: A web application could be visited for different purposes. It is possible for a web site to be visited by a
Received 8 January 2017 regular user as a normal (natural) visit, to be viewed by crawlers, bots, spiders, etc. for indexing purposes,
Revised 25 April 2017 lastly to be exploratory scanned by malicious users prior to an attack. An attack targeted web scan can be
Accepted 26 April 2017
viewed as a phase of a potential attack and can lead to more attack detection as compared to traditional
Available online xxxx
detection methods. In this work, we propose a method to detect attack-oriented scans and to distinguish
them from other types of visits. In this context, we use access log files of Apache (or ISS) web servers and
Keywords:
try to determine attack situations through examination of the past data. In addition to web scan detec-
Rule-based model
Log analysis
tions, we insert a rule set to detect SQL Injection and XSS attacks. Our approach has been applied on sam-
Scan detection ple data sets and results have been analyzed in terms of performance measures to compare our method
Web application security and other commonly used detection techniques. Furthermore, various tests have been made on log sam-
XSS detection ples from real systems. Lastly, several suggestions about further development have been also discussed.
SQLI detection Ó 2017 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction would lead to threats either directly or indirectly [1]. Because, an

overlooked vulnerability scan may result in a large scale problems
The dependency to web systems; consisting of web services and such as information leakage and privacy violation. As a matter of
web applications, is growing with time. From health sector to elec- fact, the detection of these malicious scans becomes very crucial
tronic commerce (e-commerce), internet usage is needed in all to prevent web applications from exploitation and to take effective
areas of life. Due to incremental utilization of cloud technology; countermeasures almost immediately.
there is no doubt that this dependency will increase even more. According to European Network and Information Security
However, web environment is hosting billions of users including Agency (ENISA) Threat Landscape 2015 (ETL 2015) [2], web based
malicious ones like script kiddies and cyber terrorists. Malicious and web applications attacks are ranked as number two and three
users misuse highly efficient automated scan tools to detect vul- in cyber-threat environment, and their rankings have remained
nerabilities in web applications. Obtaining diagnostic information unchanged between 2014 and 2015. Since web security related
about web applications and specific technologies thanks to these threats have been perpetually evolving, web applications are more
tools is known as ‘‘Reconnaissance” in penetration testing method- disposed to security risks [3]. Also, attack methods to web applica-
ologies and standards like The Penetration Testing Execution Stan- tions are very diverse and their trends continue for a long time. For
dard (PTES). This information gathering phase is the first phase of instance, although Structured Query Language (SQL) injection and
all attacks just before exploitation because security vulnerabilities Cross-Site Scripting (XSS) seem to be at a decreasing rate in 2014,
an increase in their exposures is seen in 2015. Therefore, one may
easily deduce that web systems are in the focus of cyber criminals.
⇑ Corresponding author.
To detect all of mentioned attacks and scans, analyzing the log
E-mail addresses: [email protected] (M. Basß Seyyar), ozgur.catak@-
tubitak.gov.tr (F.Özgür Çatak), [email protected] (E. Gül). files is usually preferred, because anomalies in users’ requests and
Peer review under responsibility of King Saud University. related server responses could be clearly identified. Two primary
reasons for this preference are that log files are easily available,
and there is no need for expensive hardware for analysis [4]. In
addition, logs may provide successful detection especially for
Production and hosting by Elsevier encrypted protocols such as Secure Sockets Layer (SSL) and Secure

http://dx.doi.org/10.1016/j.aci.2017.04.002
2210-8327/Ó 2017 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Please cite this article in press as: M. Basß Seyyar et al., Applied Computing and Informatics (2017), http://dx.doi.org/10.1016/j.aci.2017.04.002
2 M. Basß Seyyar et al. / Applied Computing and Informatics xxx (2017) xxx–xxx

Shell Daemon (SSHD) [5]. However, the heavier the website’s traf- the values of some important HTTP header fields, Uniform
fic is, the more difficult the examination of the log files gets. There- Resource Identifier (URI) strings, cookies, etc. Associating WAF,
fore, the need for an user-friendly web vulnerability scan detection Intrusion Detection System (IDS), rule engine reasoning together
tool by analyzing log files seems pretty obvious. makes this article interesting.
Therefore, the objectives of this study can be summarized as Goseva-Popstojanova et al. [8] propose a method to classify
follows: malicious web sessions through web server logs. Firstly, the
authors constitute four different data sets from honeypots; on
to detect vulnerability scans. which several web applications were installed. Afterwards, 43 dif-
to detect XSS and SQLI attacks. ferent features were extracted from web sessions to characterize
to examine access log files for detections. each session and three machine learning methods that are Support
Vector Machine (SVM), J48 and Partial Decision Trees (PART) were
Accordingly, the contributions of the work can be expressed as used to make the classifications. The authors assert that when all
follows: 43 features used in learning period, their method to distinguish
between attack and vulnerability scan sessions attains high accu-
The motivation of the relevant work is quite different, typically racy rates with low probability of false alarms. This comprehensive
focusing on machine-learning based predictive detection of research provides significant contribution in the area of web
malicious activities. Actually, all machine learning algorithms security.
have training phase and training data to built a classification Different from log analysis, Husák et al. [9] analyze extended
model. In order to increase accuracy of machine learning classi- network flow and parse HTTP requests. In addition to some Open
fier model, a large scale input training data is needed. In turn, an Systems Interconnection (OSI) Layer 3 and Layer 4 data, the
increase in memory consumption would occur. As a result, extracted HTTP information from network flow includes host
either the model would turn out to be not trainable, or training name, path, user agent, request method, response code, referrer
phase would last for days. On the other hand, executing the pro- and content type fields. To group network flow in three classes
posed rule set on access logs does not cause any memory con- such as repeated requests, HTTP scans, and web crawlers; source
sumption problems. Our script simply runs on Ubuntu Internet Protocol (IP), destination IP, and requested Uniform
terminal with a single line of code. Resource Locator (URL) split into domain and path are used. One
Another negative aspect of focusing on machine learning is of the interesting results they obtain is that the paths requested
overfitting; referring to a model that models the training data for HTTP scans are also requested for brute-force attack as well.
too well. Using a very complex models may result in overfitting However, not only HTTP requests but also HTTP responds should
that may negatively influence the model’s predictive perfor- also be analyzed to get more effective results.
mance and generalization ability [6]. Nevertheless, we design After a learning period of non-malicious HTTP logs, Zolotukhin
our rules to operate on past data which allows a detailed anal- et al. [10] analyze HTTP requests in an on-line mode to detect net-
ysis of a user’s actions [4] so that the complexity of our work intrusions. Normal user behavior, anomalies related features
approach is not too high. and intrusions detection are extracted from web resources, queries
The proposed model addresses the detection of web vulnerabil- attributes and user agent values respectively. The authors compare
ity scans on web applications by analyzing log files retrieved five different anomaly-detection methods; that are Support Vector
from web servers. Since most of the web servers log HTTP Data Description (SVDD), K-means, Density-Based Spatial Cluster-
requests by default, data is easily available to be analyzed. Thus, ing of Applications with Noise (DBSCAN), Self-Organizing Map
any extra configuration, installation, purchase or data format (SOM) and Local Outlier Factor (LOF), according to their accuracy
modification are not needed. Furthermore, our analysis is based rates in detecting intrusions. It is asserted that simulations results
upon rule-based detection strategy and we built our rule set on show higher accuracy rates compared to the other data-mining
several features of log entries. As opposed to relevant work, the techniques.
number of these features is low enough to make input data less Session Anomaly Detection (SAD) is a method developed by Cho
complex. and Cha [11] as a Bayesian estimation technique. In this model,
Finally, our work contributes to a better understanding of cur- web sessions are extracted from web logs and are labelled as ‘‘nor-
rent web security vulnerabilities. For example, we can detect mal” or ‘‘abnormal” depending on whether it is below or above the
web vulnerability scanners and learn about vulnerability itself assigned threshold value. In addition, two parameters that are page
at the same time. sequences and their frequency are investigated in training data. In
order to test their results; the authors use Whisker v1.4 as a tool
The rest of the paper is organized as follows: The related work is for generating anomalous web requests and it is asserted that
presented in Section 2. Section 3 presents our system model in The Bayesian estimation technique has been successful for detect-
details. Our model evaluation and real system test results are pre- ing 91% of all anomalous requests. Therefore, two points making
sented in Section 4. The concluding remarks are given in Section 5. this article different from the others are that SAD can be cus-
tomized by choosing site-dependent parameters; and the false
positive rates gets lower with web topology information.
2. Related work Singh et al. [12] have presented an analysis of two web-based
attacks which are i-frame injection attacks and buffer overflow
Within this section, the most related researches for vulnerabil- attacks. For analysis, log files created after attacks are used. They
ity scan detection have been reviewed. compare the size of the transferred data and the length of input
Auxilia and Tamilselvan suggest a negative security model for parameters for normal and malicious HTTP requests. As a result,
intrusion detections in web applications [7]. This method is one they just have carried out descriptive statistics and have not men-
of the dynamic detection techniques that is anomaly-based. The tioned any detection techniques.
authors propose to use Web Application Firewall (WAF) with a rule In their work, Stevanovic et al. [13] use SOM and Modified
set protecting web applications from unknown vulnerabilities. Adaptive Resonance Theory 2 (Modified ART2) algorithms for
When analyzed their rules for Hypertext Transfer Protocol (HTTP) training and 10 features related to web sessions for clustering.
attacks detection, the rules appears to be generated by checking Then, the authors label these sessions as human visitors, well-

Please cite this article in press as: M. Basß Seyyar et al., Applied Computing and Informatics (2017), http://dx.doi.org/10.1016/j.aci.2017.04.002
M. Basß Seyyar et al. / Applied Computing and Informatics xxx (2017) xxx–xxx 3

behaved web crawlers, malicious crawlers and unknown visitors. Standalone Edition (Jira 3.19.0-25-generic #26) is used as a web
In addition to classifying web sessions, similarities among the application. Access log configuration of Tomcat is set to be similar
browsing styles of Google, MSN, and Yahoo are also analyzed in to access log entries in Apache.
this article. The authors obtain lots of interesting results, one of
which is that 52% of malicious web crawlers and human visitors 3.2.2. Damn Vulnerable Web Application (DVWA)
are similar in their browsing strategies; which means that it is hard DVWA is a vulnerable PHP/MySQL web application. It is
to distinguish each other. designed to help web developers find out critical web application
Another completely different propose a semantic model that is security vulnerabilities by hands on activity. Different from illegal
named ontological model [14]. They assert that attack signatures website-hacking, it offers a totally legal environment to exploit for
are not independent from programming languages and platforms. security people. Thanks to DVWA; Brute Force, Cross Site Request
As a result, signatures may become invalid after some changes in Forgery (CSRF), Command Execution, XSS (reflected) and SQL Injec-
business logic. In contrary, their model is extendible and reusable tion vulnerabilities could be tested for three security levels; low,
and could detect malicious scripts in HTTP requests and response. medium, high.
Also, thanks to ontological model, zero day attacks could be effec- In this work, DVWA 1.0.8 version (Release date: 11/01/2011) is
tively detected. Their paper also includes a comparison between used. To install this web application, Linux Apache MySQL PHP
the proposed Semantic Model and ModSecurity. (LAMP) Server; including MySql, PHP5, and phpMyAdmin, has
There are several differences between our work and the above been installed. The reasons for studying with DVWA are to better
mentioned works. Firstly, as in the most of the related works, understand XSS and SQL Injection attacks and to find out related
checking only the user-agent header field from a list is not enough payloads substituted in query string part of URIs. In this way, rule
to detect web crawlers in the correct way. Correspondingly, we add selection to detect these attacks from access logs could be correctly
extra fields to check to make the web crawler detection more accu- determined. Also, web vulnerability scanners used in this work,
rate. Additionally, unlike machine learning and data-mining, have scanned this web application for data collection purposes.
rule-based detection has been used in the proposed model. Finally,
in contrast to other works, we prefer to use combined log format in 3.2.3. Web vulnerability scanners
order to make the number of features larger and to get more con- 3.2.3.1. Acunetix. Acunetix is one of the most commonly used com-
sistent results. mercial web vulnerability scanners. Acunetix scans a web site
according to the determined configurations, produces a report
3. System model about the existing vulnerabilities, groups them as high, medium,
low and informational; and identifies the threat level of the web
In this section, we describe how we construct and design the application with the related mitigation recommendations. In the
proposed model in detail. Also, we present our rules with underly- context of this work, Acunetix Web Vulnerability Scanner (WVS)
ing reasons. Reporter v7.0 has been used with default scanning configurations
in addition to site login information.
3.1. Assumptions
3.2.3.2. Netsparker. Netsparker is a web application security scan-
ner that is commercial too. Netsparker detects security vulnerabil-
In access logs, POST data can not get logged. Thus, the proposed
ities of a web application and produces a report including
method cannot capture this sort of data.
mitigation solutions. In addition, detected vulnerabilities could
Browsers or application servers may support other encodings.
be exploited to confirm the report results. In the context of this
Since only two of them are in the context of this work, our script
work, Netsparker Microsoft Software Library (MSL) Internal Build
cannot capture data encoded in other styles.
4.6.1.0 along with Vulnerability Database 2016.10.27.1533 has
Our model is designed for detection of two well-known web
been used with special scanning configurations including custom
application attacks and malicious web vulnerability scans, not
cookie information.
for prevention. Thus, working on-line mod is not included in
the context of our research.
3.2.3.3. Web Application Attack and Audit Framework (W3AF). W3AF
is an open source web application security scanner. W3AF is devel-
3.2. Data and log generation oped using Python and licensed under General Public License (GPL)
v2.0. Framework is designed to help web administrators secure the
In this section, tools, applications, virtual environment used web applications. W3AF could detect more than 200 vulnerabilities
throughout this work and their installation and configuration set- [17]. W3AF has several plug-ins for different operations such as
tings are explained. crawling, brute forcing, and firewall bypassing. W3AF comes by
default in Kali Linux and could be found in ‘‘Applications/Web
3.2.1. Web servers Application Analysis/Web Vulnerability Scanners”. W3AF version
3.2.1.1. HTTP Server. As mentioned earlier, Apache/2.4.7 (Ubuntu) 1.6.54 has been used with ‘‘fast-scan” profile through audit, crawl,
Server is chosen as a web server. Apache is known to be the most grep and output plugins.
commonly used web server. According to the W3Techs (Web Tech-
nology Surveys) [15], as of December 1, 2016; Apache is used by 3.3. Rules and methodology
51.2 percent of all web servers. In addition, it is open source, highly
scalable and has a dynamically loadable module system. Apache As mentioned earlier, our script runs on access log files. The
installation is made via apt-get command-line package manager. main reason for this choice is the opportunity for detailed analysis
Any extra configuration is not necessary for the scope of this work. about users actions. By examining past data, information security
policies for the web applications could be correctly created and
3.2.1.2. Apache Tomcat. The Apache Tomcat being an implementa- implemented. Additionally, further exploitations could be pre-
tion of the Java Servlet, JavaServer Pages, Java Expression Language vented in advance. Unlike the proposed model, Network Intrusion
and Java WebSocket technologies, is an open source software [16]. Detection System (NIDS) may not detect attacks when HTTPS is
In this work, Apache Tomcat Version 8.0.33 is used. Atlassian JIRA used [4]. However, working with logs has some disadvantages.

Please cite this article in press as: M. Basß Seyyar et al., Applied Computing and Informatics (2017), http://dx.doi.org/10.1016/j.aci.2017.04.002
4 M. Basß Seyyar et al. / Applied Computing and Informatics xxx (2017) xxx–xxx

Since log files do not contain all data of HTTP request and response, extra module that directly accesses logs, or a script that analyses
some important data could not be analyzed. For example, POST logs faster could be developed to use our approach in a live or real
parameters that are vulnerable to injections attacks could not be environment.
logged by web servers. Another negative aspects are the size of logs Our method could be described as rule-based detection. Unlike
and parsing difficulty. Nevertheless, to solve this problem, we sep- anomaly based detection, our rules are static including both black-
arate the access log files on a daily basis. Therefore, web adminis- list and whitelist approaches. In detail, XSS and SQL injection
trators might run our script every day to check for an attack. Lastly, detection part of our method is a positive security model; on the
real-time detection and prevention is not possible with the pro- other hand, the rest is a negative security model. Thus, data eva-
posed method which runs off-line. Thus, we could not guarantee sion is tried to be kept at a minimum level. In order to classify IP
to run on-line. In fact, this approach is conceptually sufficient for addresses in the access log file, we identify three different visitor
the scope of this work. Differently from the test environment; an types as follows:

Fig. 1. Flow chart of the proposed rule-based model.

Please cite this article in press as: M. Basß Seyyar et al., Applied Computing and Informatics (2017), http://dx.doi.org/10.1016/j.aci.2017.04.002
M. Basß Seyyar et al. / Applied Computing and Informatics xxx (2017) xxx–xxx 5

Type 1: Regular (normal) users with a normal (natural) visit. As shown in Fig. 1 in Phase 1, our first step is to detect SQL
Type 2: Crawlers, bots, spiders or robots. injection and XSS attacks. Although different places of HTTP (the
Type 3: Malicious users using automated web vulnerability HTTP body, URI) could be used to exploit a vulnerability [4]; we
scanners. will analyze path and query parts of the requested URI for
detection.
In detail; for XSS, we use regular expressions to recognize some
Table 1
patterns such as HTML tags, ‘src’ parameter of the ‘img’ tag and
HTTP methods in Acunetix.
some Javascript event handlers. Likewise; for SQL injection, we
HTTP method Number check the existence of the singlequote, the doubledash, ‘#’, exec()
Connect 2
Get 2758
Options 2
Post 668 Table 6
Trace 2 Details of classified data sets.
Track 2
Visitor type Log file Line number IP number
Total 3434
Type 1 Normal 62,539 15
Type 2 Web robot 28,804 143
Type 3 Acunetix 6539 1
Type 3 Netsparker 7314 1
Table 2
Type 3 W3AF 3996 2
HTTP methods in Netsparker.
Type 1, 2 and 3 Total 109,192 162
HTTP method Number
Get 3059
Head 590
Netsparker 1 Table 7
Options 14 Confusion matrix.
Post 956
Propfind 14 Actual: Type 3 Actual: Type 1 or 2
Total 4634 Predicted: Type 3 TP = 3 FN = 1
Predicted: Type 1 or 2 FP = 0 TN = 158

Table 3 Table 8
HTTP status codes in Netsparker. Summary of results for general data set.
HTTP status code Number IP number Accuracy Precision Recall F1
200 177 162 99.38% 100.00% 75.00% 85.71%
301 1
302 23
404 494
500 6
1200
Total 701
Running Time (in seconds)

1000

Table 4 800
HTTP status codes in W3AF.

HTTP status code Number 600

200 91
302 8 400
404 30
500 6 200

Total 135
0
00

00
0

0
00

00
10

Table 5 Log Lines

HTTP status codes in Acunetix.

HTTP status code Number Fig. 2. Time performance of the proposed method.

200 598
301 38
302 686 Table 9
400 44 Details of log samples.
403 16
Log file Log duration File size Line number IP number
404 2022
405 4 Data Set 1 5 days 43 MB 202,145 3910
406 2 Data Set 2 210 days 13.4 MB 34,487 9269
417 2 Data Set 3 270 days 7.2 MB 36,310 4719
500 20 Data Set 4 90 days 1.3 MB 5936 1795
501 2 Data Set 5 90 days 0.48 MB 3554 579
Total 3434 Total 665 days 65.37 MB 282,432 20,272

Please cite this article in press as: M. Basß Seyyar et al., Applied Computing and Informatics (2017), http://dx.doi.org/10.1016/j.aci.2017.04.002
6 M. Basß Seyyar et al. / Applied Computing and Informatics xxx (2017) xxx–xxx

Table 10
Data sets test results.

Data set Period IP number Type 3 IP number Type 3 percentage (%)

Data Set 1 2004 10/March 370 13 3.51
11/March 786 20 2.54
12/March 1002 22 2.20
13/March 1960 39 1.99
14/March 1079 21 1.95
Data Set 2 2004 April 3140 1 0.03
May 4546 3 0.07
June 701 6 0.86
July 735 4 0.54
August 189 1 0.53
September 280 0 0.00
October 106 1 0.94
Data Set 3 2005 June 663 1 0.15
July 755 1 0.13
August 577 0 0.00
September 731 1 0.14
October 452 0 0.00
November 623 19 3.05
December 181 1 0.55
January 652 45 6.90
February 802 34 4.24
Data Set 4 2005 1–15/June 160 1 0.63
16–30/June 497 0 0.00
1–15/July 503 0 0.00
16–30/July 280 1 0.36
1–15/August 284 0 0.00
16–30/August 282 0 0.00
Data Set 5 2005 16–31/January 28 0 0.00
1–15/February 176 0 0.00
16–28/February 112 0 0.00
1–15/March 225 3 1.33
16–30/March 28 0 0.00

function and some SQL keywords. In addition, since there is a pos- 3. Web robots have higher unassigned referrer (‘‘–”) rates [23–25].
sibility for URL obfuscation, Hex and UTF-8 encodings of these pat- 4. According to the access logs that we analyzed, user-agent
terns are also taken in consideration. header field of web robots may contain some keywords such
Afterwards, we continue by separating IP addresses of Type 2 as bot, crawler, spider, wanderer, and robot.
from the rest of the access log file in Phase 2. To do this, two differ-
ent approaches are used. Firstly, user-agent part of all log entries is As a result of above mentioned observations, we add some extra
compared with the user-agent list from robots database that is rules to correctly distinguish Type 2 from other visitors.
publicly available in [18]. However, since this list may not be up- For the rest our rule set as indicated at Phase 3 in Fig. 1, we con-
to-date, another bot detection rules are added. In order to identify tinue by investigating our access log files formed as a result of vul-
these rules, we use the following observations about web robots: nerability scanning mentioned in the previous section. As shown in
Tables 1 and 2, our first immediate observation is that as compared
1. Most of the web robots make a request for ‘‘/robots.txt” file [19]. to Type 2 and Type 1, Type 3’s requests include different HTTP
2. Web robots have higher rate of ‘‘4xx” requests since they usu- methods; such as Track, Trace, Netsparker, Pri, Propfind and Quit.
ally request unavailable pages [20–23]. Secondly, as shown in Table 3, Tables 4 and 5; we deduct that

4 1

3.5 0.9

3 0.8

0.7
2.5
Type 3 (%)

Type 3 (%)

0.6
2
0.5
1.5
0.4
1
0.3

0.5 0.2

0 0.1
04 04 04 04 04
ar. ar. ar. ar. ar. 0
.M .M .M .M .M
10 11 12 13 14 Apr May Jun Jul Aug Sep Oct
Days Months

Fig. 3. Data Set 1 test results. Fig. 4. Data Set 2 test results.

Please cite this article in press as: M. Basß Seyyar et al., Applied Computing and Informatics (2017), http://dx.doi.org/10.1016/j.aci.2017.04.002
M. Basß Seyyar et al. / Applied Computing and Informatics xxx (2017) xxx–xxx 7

status codes of Type 3 differ from Type 2 and Type 1. In fact, Type 3 The pseudo code of the proposed model is shown in Algorithm
has higher rate of ‘‘404” requests, average of which for Acunetix, 1:
Netsparker and W3AF is 31% in our data set. Thus, we generate a
rule to check the presence of these HTTP methods and the percent-
age of ‘‘404” requests. User-agent header fields of Type 3 could Algorithm 1. Pseudo-Code for Proposed Model.

∈
←−
∈

←−
∈

←−
←−
←−
←−
” ”
∈ ” ” ” ” ” ” ” ” ” ”
” ” ∈

←−
∈

←−
←−
←−
∈
” ” ” ” ” ” ” ” ” ” ” ”

generally be modified and obfuscated manually at the configura- 4. Results

tion phase before vulnerability scan. Even so, we made a list of
well-known automated web vulnerability scanners, and compare This section is based on the evaluation of our model against
it with user-agent header fields. Finally, we notice that these scan- some important metrics. Moreover, test results of attack detection
ners make at least more than 100 HTTP requests in a certain time, on live data are also included.
we select this value as a threshold for Type 3 detection.

Please cite this article in press as: M. Basß Seyyar et al., Applied Computing and Informatics (2017), http://dx.doi.org/10.1016/j.aci.2017.04.002
8 M. Basß Seyyar et al. / Applied Computing and Informatics xxx (2017) xxx–xxx

4.1. Experimental setup Fig. 2 illustrates the relation between the line number of the log
files and the running time. It is clear that the running time rises
To implement our rules, Python programming language version steadily as the number of the lines increases.
3.5 has been chosen. Script is executed on Ubuntu operating sys-
tem mentioned in Section 3.2.2 via terminal. To parse log lines,
‘‘apache-log-parser 1.7.0” which is a Python package has been 4.3. Scan detection on live data
used. As well as, we benefit from python libraries that are collec-
tions, datetime, numpy, ua-parser and argparse. We have built or model according to the data sets mentioned in
Since there are not any actual, publicly available and labelled Section 4.1. Additionally, we test our model according to several
data sets to evaluate our model, we create our data sets. In fact, large-scale, live, not labelled and publicly available data sets. In
we deploy two different web applications on two different web this section, we share our test results illustrated in tables and
servers to form Type 1 and Type 3 traffics. Details are expressed graphs.
in Section 3.2.2. In accordance with this purpose, we have used log samples
Type 1 (normal traffic) is the data set collected from Jira Soft- from real systems [26]. As stated in the related web source, these
ware as a web application running on Tomcat web server during samples are collected from various systems, security devices,
4 days.The size of the related access log file is 16.3 MB. As shown applications, etc.; and neither Chuvakin nor we did not sanitize,
Table 6, log file contains 62,539 log entries from 15 different IP anonymized or modified them in any way. Since they include HTTP
addresses. These requests are generated in a local network. access logs, we have chosen the log samples named Bundle 9, Bun-
For Type 2 traffic, an external traffic that is open to the internet dle 7, Bundle 1, Bundle 4 and Bundle 3. For the rest of the work,
is needed. To this end, we make use of three different access log these bundles are expressed as Data Set 1, Data Set 2, Data Set 3,
files retrieved from a company website. In detail, log files contain Data Set 4 and Data Set 5 respectively. Details of these data sets
crawling data collected during 13 days from requests of several are shown in Table 9.
web robots. The size of the related access log files is totally In order to test the log samples, Data Set 1, Data Set 2, Data Set
6.4 MB, and log files contain 28,804 log entries from 143 different 3, Data Set 4 and Data Set 5 are divided into daily, monthly,
IP addresses as shown Table 6. monthly, 15-day and 15-day periods respectively. Related details
To generate Type 3 traffic, DVWA running on Apache HTTP Ser- are expressed in Table 10.
ver is used as a web application. Before scanning, the security level Type 3 percentage of each data set is shown in Figs. 3–7.
of DVWA is configured as low security. Moreover, we scan this
application via Acunetix, Netsparker and W3AF as web vulnerabil-
ity scanners. Firstly, DVWA is scanned for 22 min and 22 s with 7
Acunetix. Secondly, DVWA is scanned for 19 min and 56 s with
Netsparker. Lastly, DVWA is scanned for 2 min and 6 s with 6
W3AF. The details of the related access log files are summarized
as Type 3 in Table 6. 5

For the evaluation of the proposed model, we combine all men-

Type 3 (%)

tioned access log files into one file that is our general data set. 4

Then, we run our Python script on the mentioned data set.

2
4.2. Model evaluation

1
Initially, to evaluate the proposed model, we compute the con-
fusion matrix where TP, FN, FP, and TN denote true negatives, false
0
negatives, false positives, and true negatives respectively as shown Jun Jul Aug Sep Oct Nov Dec Jan Feb
in Table 7. Months
After, we evaluate the following measures:
Fig. 5. Data Set 3 test results.

ðTN þ TPÞ
accuracyðaccÞ ¼
ðTN þ FN þ FP þ TPÞ
0.7
ðTPÞ
precisionðprecÞ ¼
ðTP þ FPÞ 0.6
ð1Þ
ðTPÞ
recallðrecÞ ¼
ðTP þ FNÞ 0.5

ð2TPÞ
F1 score ¼
Type 3 (%)

0.4
ð2TP þ FP þ FNÞ
0.3
More specifically, the accuracy provides the percentage of Type 3
that are detected correctly. The precision determines the fraction 0.2
of IP addresses correctly classified as Type 3 over all IP addresses
classified as Type 3. The recall (a.k.a. sensitivity) is the fraction of 0.1
IP addresses correctly classified as Type 3 over all IP addresses of
Type 3. Finally, the F1 -score is a harmonic mean of precision and 0
1 2 3 4 5 6
recall. As a result, our model has 99.38% accuracy, 100.00% preci-
15-Day Period
sion, 75.00% recall and finally 85.71% F1 score as we can see in
Table 8. Fig. 6. Data Set 4 test results.

Please cite this article in press as: M. Basß Seyyar et al., Applied Computing and Informatics (2017), http://dx.doi.org/10.1016/j.aci.2017.04.002
M. Basß Seyyar et al. / Applied Computing and Informatics xxx (2017) xxx–xxx 9

1.4 [2] European Union Agency for Network and Information Security (ENISA), ENISA
Threat Landscape 2015. URL <https://www.enisa.europa.eu/publications/
etl2015>, 2016 (accessed November 29, 2016).
1.2
[3] D.V. Bernardo, Clear and present danger: interventive and retaliatory
approaches to cyber threats, Appl. Comput. Infor. 11 (2) (2015) 144–157,
1 http://dx.doi.org/10.1016/j.aci.2014.11.002, URL <http://
www.sciencedirect.com/science/article/pii/S2210832714000386>.
[4] R. Meyer, Detecting Attacks on Web Applications from Log Files. URL <https://
Type 3 (%)

0.8
www.sans.org/reading-room/whitepapers/logging/detecting-attacks-web-
applications-log-files-2074>, 2008 (accessed December 12, 2016).
0.6 [5] D.B. Cid, Log Analysis using OSSEC. URL <http://www.academia.edu/8343225/
Log_Analysis_using_OSSEC>, 2007 (accessed November 29, 2016).
[6] Wikipedia, Overfitting. URL <https://en.wikipedia.org/wiki/Overfitting>, 2016
0.4
(accessed December 27, 2016).
[7] M. Auxilia, D. Tamilselvan, Anomaly detection using negative security model in
0.2 web application, in: 2010 International Conference on Computer Information
Systems and Industrial Management Applications (CISIM), 2010, pp. 481–486,
http://dx.doi.org/10.1109/CISIM.2010.5643461.
0
1 2 3 4 5 [8] K. Goseva-Popstojanova, G. Anastasovski, R. Pantev, Classification of malicious
web sessions, in: 2012 21st International Conference on Computer
15-Day Period
Communications and Networks (ICCCN), 2012, pp. 1–9, http://dx.doi.org/
10.1109/ICCCN.2012.6289291.
Fig. 7. Data Set 5 test results.
[9] M. Husák, P. Velan, J. Vykopal, Security monitoring of http traffic using
extended flows, in: 2015 10th International Conference on Availability,
Reliability and Security, 2015, pp. 258–265, http://dx.doi.org/10.1109/
5. Conclusion ARES.2015.42.
[10] M. Zolotukhin, T. Hämäläinen, T. Kokkonen, J. Siltanen, Analysis of http
requests for anomaly detection of web attacks, in: 2014 IEEE 12th
In this work, we studied web vulnerability scans detection International Conference on Dependable, Autonomic and Secure Computing,
through access log files of web servers in addition to detection of 2014, pp. 406–411, http://dx.doi.org/10.1109/DASC.2014.79.
[11] S. Cho, S. Cha, Sad: web session anomaly detection based on parameter
XSS and SQLI attacks. In accordance with this purpose, we used
estimation, Comput. Secur. 23 (4) (2004) 312–319, http://dx.doi.org/10.1016/
rule-based methodology. Firstly, we examined the behavior of j.cose.2004.01.006, URL <http://www.sciencedirect.com/science/article/pii/
the automated vulnerability scanners. Moreover, we implemented S0167404804000264>.
our model with a Python script. Afterwards, our model has been [12] N. Singh, A. Jain, R.S. Raw, R. Raman, Detection of Web-Based Attacks by
Analyzing Web Server Log Files, Springer India, New Delhi, 2014, http://dx.doi.
evaluated based on data we have collected. Finally, we tested our org/10.1007/978-81-322-1665-0_10, pp. 101–109.
model on the log samples from real systems. [13] D. Stevanovic, N. Vlajic, A. An, Detection of malicious and non-malicious
It is clear that our method has very high probability of detection website visitors using unsupervised neural network learning, Appl. Soft
Comput. 13 (1) (2013) 698–708, http://dx.doi.org/10.1016/j.
and low probability of false alarm. More specifically, the accuracy asoc.2012.08.028, URL <http://www.sciencedirect.com/science/article/pii/
and the precision rates of our model are 99.38%, 100.00% respec- S1568494612003778>.
tively. More importantly, malicious scans can be captured more [14] A. Razzaq, Z. Anwar, H.F. Ahmad, K. Latif, F. Munir, Ontology for attack
detection: an intelligent approach to web application security, Comput. Secur.
precisely because different types of scanning tools including both 45 (2014) 124–146, http://dx.doi.org/10.1016/j.cose.2014.05.005, URL <http://
open source and commercial tools were examined. Therefore, our www.sciencedirect.com/science/article/pii/S0167404814000868>.
results indicates that static rules can detect successfully web vul- [15] W3Techs (Q-Success DI Gelbmann GmbH), Usage Statistics and Market Share
of Apache for Websites. URL <https://w3techs.com/technologies/details/ws-
nerability scans. Besides, we have observed that our model func-
apache/all/all>, 2009–2017 (accessed December 12, 2016).
tions properly with larger and live data sets and correctly detects [16] The Apache Software Foundation, Apache Tomcat. URL <http://tomcat.apache.
Type 3 IP addresses. org> (accessed December 24, 2016).
[17] w3af.org, w3af. URL <http://w3af.org>, 2013 (accessed December 12, 2016).
As shown in the Fig. 2, the relation between the number of lines
[18] The Web Robots Pages, Robots Database. URL <http://www.robotstxt.org/db.
of the log files and the running time is linear. As a result, how long html> (accessed September 4, 2016).
a log file would be analyzed, could be predicted in advance. [19] M.C. Calzarossa, L. Massari, D. Tessera, An extensive study of web robots traffic,
The results presented in this work may enhance researches in: Proceedings of International Conference on Information Integration and
Web-based Applications & Services, IIWAS ’13, ACM, New York, NY, USA,
about malicious web scans and may support the development of 2013, pp. 410:410–410:417, http://dx.doi.org/10.1145/2539150.2539161.
attack detection studies. Also, if security analysts or administrators [20] M.D. Dikaiakos, A. Stassopoulou, L. Papageorgiou, An investigation of web
execute the proposed python script several times within the same crawler behavior: characterization and metrics, Comput. Commun. 28 (8)
(2005) 880–897, http://dx.doi.org/10.1016/j.comcom.2005.01.003, URL
day, he/she could prevent most of the web related attacks. <http://www.sciencedirect.com/science/article/pii/S0140366405000071>.
Future work considerations related to this work are twofold. In [21] M. Dikaiakos, A. Stassopoulou, L. Papageorgiou, Characterizing Crawler
the first place, one could make our model possible to analyze other Behavior from Web Server Access Logs, Springer Berlin Heidelberg, Berlin,
Heidelberg, 2003, http://dx.doi.org/10.1007/978-3-540-45229-4_36.
log files such as audit log and error log. Secondly, in addition to the [22] M.C. Calzarossa, L. Massari, Analysis of Web Logs: Challenges and Findings,
scope of this work; different from SQLI and XSS attacks, other well- Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, http://dx.doi.org/
known web application attacks like CSRF could be addressed too. 10.1007/978-3-642-25575-5_19, pp. 227–239.
[23] D. Stevanovic, A. An, N. Vlajic, Feature evaluation for web crawler detection
with data mining techniques, Expert Syst. Appl. 39 (10) (2012) 8707–8717,
Appendix A. Supplementary material http://dx.doi.org/10.1016/j.eswa.2012.01.210, URL <http://
www.sciencedirect.com/science/article/pii/S0957417412002382>.
[24] A.G. Lourenço, O.O. Belo, Catching web crawlers in the act, in: Proceedings of
Supplementary data associated with this article can be found, in the 6th International Conference on Web Engineering, ICWE ’06, ACM, New
the online version, at http://dx.doi.org/10.1016/j.aci.2017.04.002. York, NY, USA, 2006, pp. 265–272, http://dx.doi.org/10.1145/
1145581.1145634.
[25] D. Stevanovic, A. An, N. Vlajic, Detecting Web Crawlers from Web Server
References Access Logs with Data Mining Classifiers, Springer Berlin Heidelberg, Berlin,
Heidelberg, 2011, http://dx.doi.org/10.1007/978-3-642-21916-0_52, pp. 483–
[1] E.M. Hutchins, M.J. Cloppert, R.M. Amin, Intelligence-driven Computer 489.
Network Defense Informed by Analysis of Adversary Campaigns and [26] A. Chuvakin, Public Security Log Sharing Site. URL <http://log-
Intrusion Kill Chains, vol. 1, API, 2011, URL <https://books.google.com. sharing.dreamhosters.com>, 2009 (accessed December 15, 2015).
tr/books?id=oukNfumrXpcC>.