NetFlow
NetFlow
Visualisation System
Rafal Kozik, Robert Mlodzikowski, Michal Choraś
1 Introduction
Nowadays, one of the cybersecurity challenges is to counter the malicious soft-
ware [1]. Usually, malware samples are carefully crafted pieces of computer pro-
grams that aim at staying dormant while performing detailed surveillance of
infected infrastructures and assets. Infected computers commonly connect to-
gether over the telecommunication network and form so-called botnet that can
be easily centrally controlled by the cybercriminals for different malicious pur-
poses such as DDoS attacks, SPAM distribution, sensitive data thefts, extortion
attacks, etc.
In order to combat such cyber threat, one may use different solutions. How-
ever, commonly used anti-virus software may not be efficient enough to protect
the network. An example is the case of the polish financial sector problem that
happened in 2017 [2]. During that attack, the largest system hack in the coun-
try’s history took place and several banks in Poland have been infected with
malware. This particular malware was a new strain of malicious software which
has never been seen before in live attacks and it had a zero detection rate on
VirusTotal.
2 Rafal Kozik, Robert Mlodzikowski, Michal Choraś
2 Related Work
Commonly the signatures (in form of reactive rules) of an attack for a software
like Snort [4] are provided by experts from a cyber community. Typically, for
deterministic attacks, it is fairly easy to develop patterns that will clearly identify
the particular attack. It often happens when given malicious software (e.g. worm)
uses the same protocol and algorithm to communicate trough network with
command and control centre or other instance of such software. However, the
task of developing new signatures becomes more complicated when it comes to
polymorphic worms or viruses. Such software commonly modifies and obfuscates
its code (without changing the internal algorithms) in order to be less predictive
and hard to detect.
The development of an efficient and scalable method for malware detection is
currently challenging also due to the general unavailability of raw network data.
Therefore, this aspect while being related to users privacy and administrative
and legal reasons causes additional difficulties for research and development [5,
6].
Currently, the common alternative is so-called NetFlow [7] data that is of-
ten captured by ISPs for auditing and performance monitoring purposes. Since
NetFlow samples do not contain any sensitive data they are widely available.
However, the fact that this kind of samples is lacking raw content of network
packets is the disadvantage.
In the literature, there are different approaches focusing on the analysis
of NetFlow data. In [8, 9] authors focused on computational paradigms (e.g.
MapReduce) for NetFlow data analysis and malware detection. On the other
hand, in [10][11] author proposed statistical techniques for feature extraction
from groups of network flows.
Netflow-based Malware Detection and Data Visualisation System 3
The BClus [12] method uses behavioural approach for botnet detection. It
aggregates NetFlows for specific IP addresses and clusters them according to
statistical characteristics. The properties of the clusters are described and used
for further botnet detection. Another approach is used in BotHunter [13] tool.
It monitors the two-way communication flows between hosts within internal
network and the Internet. BotHunter employs Snort intrusion detection system.
It models an infection sequence as a composition of participants and a loosely
ordered sequence of network information exchanges.
Monitored Local
@
Area Network
@ Pre-processing
@
PACKETS NetFlows
PROCESSING Extraction
Router
DATA STORAGE
GUI
In the Fig. 1 the general overview of the system design is presented. The col-
lected raw data is processed in order to extract the NetFlows. The NetFlow is
a standardised format for describing bidirectional communication and contains
such information as IP source and destination address, destination port, amount
of bytes exchanged, etc. The extracted NetFlows are stored in the database
4 Rafal Kozik, Robert Mlodzikowski, Michal Choraś
for further processing, so that the data mining and feature extraction methods
currently work in the batch processing mode. However, in the future, we plan
to allow the system to analyse directly the streams of data containing the raw
NetFlows.
The single NetFlow usually does not provide enough evidence to decide when-
ever the particular machine is infected or if the particular request has malicious
symptoms. Therefore, it is quite common [12] that NetFlows are aggregated in
so-called time windows so that more contextual data can be extracted and ma-
licious behaviour recorded (e.g. port scanning, packet flooding effects, etc.). In
order to do that different statistics can be extracted for each time window. In
the current version of the proposed system, we have implemented two different
methods for pattern extraction (the Feature Extraction block on the diagram).
These methods have been described in the consecutive subsections. In general,
these methods produce the feature vectors that are further used to learn different
ML algorithms (the Data Mining and Machine Learning block on the diagram).
The machine learning algorithms are available via the Weka[3] library.
The system is also facilitated with graphical user interface (indicated as GUI
on the diagram) which allows the network administrator to visualise different
statistical properties of the analysed traffic (e.g. amount of data generated by
specific IP addresses or the most active ones) as well as classification results.
Some aspects of the visualisation process have also been described in separate
section.
In order to evaluate the effectiveness of different algorithms, we have used
CTU-13 dataset. It contains different scenarios representing different infections
and malware communication schemes with command and control. Therefore, in
this paper, we do not consider the problem of the realistic testbed construction.
3.1 Method 1
The first feature extraction method aggregates NetFlows in a time window (in
this approach we use 1-minute long time windows). One of the reasons behind
the aggregation process is the context identification in order to capture relevant
behaviours of different hosts. For each time window the following statistics are
calculated:
– number of NetFlows
– total sum of transferred bytes
– average sum of transferred bytes per NetFlow
– number of unique destination IP addresses
One of the advantages of this approach is the fact that the number of features
vectors is equal to the number of time windows. Therefore, for the short scenarios
the size of the resulting dataset will be small and thus the machine learning
process will be faster.
However, one of the obvious drawbacks is the fact that for this approach it is
impossible to identify the IP address of the infected machine because the system
will only signal that particular time window should be considered anomalous.
Netflow-based Malware Detection and Data Visualisation System 5
3.2 Method 2
The second feature extraction method, similar to the previous one, aggregates
NetFlows in the time windows. However, for each time window, we additionally
group the NetFlow by IP source addresses. For each group (time window, IP
source address) we calculate the following statistics:
– number of flows
– sum of transferred bytes
– average sum of bytes per NetFlow
– average communication time between unique IPs
– number of unique IP addresses
– number of unique destination ports
– most frequently used protocol (e.g. TCP, UDP) by specific IP source address
4 Data Visualisation
Fig. 2. An example of the proposed system visualisation capabilities for the selected
scenario. The figure presents the amount of traffic generated on average by all hosts
(green line), by the most active hosts (orange line), and the infected host (red line).
Fig. 3. The figure presents the number of established connection by different hosts for
the analysed scenario. It can be noticed that the infected host establishes suspiciously
high number of connections.
Netflow-based Malware Detection and Data Visualisation System 7
5 Experiments
5.1 Validation Methodology
For the evaluation purposes, we have adapted stratified 10-fold cross-validation
methodology. The method was used to assess the TPR - true positives, and FPR
- false positives rates.
True Positives Ratio (TPR) is defined as the number samples (feature vec-
tors) identified correctly as infected (True Positives - TP) divided by the number
of all samples that are infected (True Positives + False Negatives).
TP
TPR = (1)
TP + FN
False Positives Ratio (FPR) is defined as the number of samples identified
wrongly as infected (False Positives - FP) divided by the number of all clean
samples (True Negatives + False Positives).
FP
FPR = (2)
TN + FP
The procedure for effectiveness evaluation is following:
The algorithms are learnt and evaluated 10 times and the obtained results
are averaged.
– Dur - Duration,
– Proto - IP protocol (e.g. UTP, TCP),
– SrcAddr - Source address,
– Sport - Source port,
– Dir - Direction of the recorded communication,
– DstAddr - Destination Address,
– Dport - Destination Port,
– State - Protocol state,
– sTos - Source type of service,
– dTos - Destination type of service,
– TotPkts - Total number of packets that have been exchanged between source
and destination,
– TotBytes - Total bytes exchanged,
– SrcBytes - Number of bytes send by source,
– Label - Label - label assigned to this NetFlow (e.g. Background, Normal,
Botnet)
It must be noted that the ”Label” field is an additional attribute provided
by authors of the dataset. Normally, the NetFlow will have 14 attributes and
the ”Label” will be assigned by the classifier.
5.3 Results
The proposed methods have been evaluated on the scenario concerning the Rbot
malware. According to the scenario description, the malware realises ICMP
DDoS attack.
The values of TPR and FPR ratios have been presented in Table 1. The
results obtained with the second method for feature extraction have achieved
better results. The average effectiveness of botnet detection for all the classifiers
for the first method is 47.0% while for the second method is 63.0%. However,
the classifiers combined with the first method for pattern extraction yielded high
FP ratios.
The conclusion from this experiment is that the second feature extraction
method combined with RandomForest (or RandomCommittee) allowed us to
achieve 66.7% of malware detection while having no false positives.
6 Conclusions
In this paper, we have proposed preliminary results of the malware detection
method. Our approach relies on the analysis of malware network activity that
is captured by means of NetFlows. We have presented the architecture of the
proposed system. The current implementation includes two methods for pattern
extraction that analyses the NetFlows in disjoint time windows. The extracted
feature vectors have been used to train different machine learning algorithms.
The methods have been evaluated on the publicly available dataset. Future work
will be dedicated to the evaluation of scalability of the proposed methods and
further improvements towards online machine learning.
Netflow-based Malware Detection and Data Visualisation System 9
References
1. Choras Michal, Kozik R., Renk R., Holubowicz W., A Practical Framework and
Guidelines to Enhance Cyber Security and Privacy , in: Herrero A., Baruque
B., Sedano J., Quintan H., Corchado E. (Eds), International Joint Conference
CISIS’15 and ICEUTE’15, Advances in Intelligent Systems and Computing, 485-
496, ISBN 978-3-319-19712-8, Springer 2015.
2. The Hacker News web page. Polish Banks Hacked using Malware Planted
on their own Government Site: http://thehackernews.com/2017/02/
bank-hacking-malware.html.
3. WEKA Data Mining Software. URL: http://www.cs.waikato.ac.nz/ml/weka/.
4. SNORT. Project homepage. http://www.snort.org/
5. Andrysiak T., Saganowski L., Choras Michal, Kozik R., Network Traffic Prediction
and Anomaly Detection Based on ARFIMA Model , in Jos Gaviria de la Puerta et
al., Advances in Intelligent Systems and Computing, vol. 229, 545-554, Springer,
2014.
6. Choras Michal, Kozik R., Puchalski D., Holubowicz W., Correlation Approach
for SQL Injection Attacks Detection, In: Herrero A. et al (Eds.), Advances in
Intelligent and Soft Computing, 189, 177-186, Springer, 2012.
7. Claise, B., Cisco Systems NetFlow Services Export Version 9. RFC 3954 (Infor-
mational), 2004
8. J. Francis, S. Wang, R. State, and T. Engel, Bottrack: Tracking botnets using
netflow and pagerank, in Proceedings of IFIP/TC6 Networking, 2011.
9. J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clus-
ters, in Symposium on Opearting Systems Design and Implementation (OSDI).
USENIX Association, 2004.
10. Lakhina A, Crovella M, Diot C. Diagnosing network-wide traffic anomalies. ACM
SIGCOMM Computer Communication Review; 2004;34:357-374
11. Lakhina A, Crovella M, Diot C. Mining anomalies using traffic feature distribu-
tions. ACM SIGCOMM Computer Communication Review; 2005;35:217-228.
12. S. Garcia, M. Grill, J. Stiborek, A. Zunino,An Empirical Comparison of Botnet
Detection Methods, Computers & Security, vol. 45, pp. 100–123, 2014
13. BotHunter homepage, url: http://www.bothunter.net/about.html
14. Garcia S., Grill M., Stiborek H. and Zunino A. An empirical comparison of botnet
detection methods. Computers and Security Journal, Elsevier. 2014. Vol 45, pp
100–123