0% found this document useful (0 votes)
42 views6 pages

2021 SECRYPT Granef Utilization of A Graph Database For Network Forensics Paper - Archive

Uploaded by

Garima Gaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views6 pages

2021 SECRYPT Granef Utilization of A Graph Database For Network Forensics Paper - Archive

Uploaded by

Garima Gaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

GRANEF: Utilization of a Graph Database for Network Forensics

Milan Cermak and Denisa Sramkova


a

Institute of Computer Science, Masaryk University, Brno, Czech Republic


cermak® ics.muni.cz, [email protected]

Keywords: Network Forensics, Graph Database, Dgraph, Zeek, Association-based Analysis

Abstract: Understanding the information in captured network traffic, extracting the necessary data, and performing inci-
dent investigations are principal tasks of network forensics. The analysis of such data is typically performed by
tools allowing manual browsing, filtering, and aggregation or tools based on statistical analyses and visualiza-
tions facilitating data comprehension. However, the human brain is used to perceiving the data in associations,
which these tools can provide only in a limited form. We introduce a G R A N E F toolkit that demonstrates
a new approach to exploratory network data analysis based on associations stored in a graph database. In this
article, we describe data transformation principles, utilization of a scalable graph database, and data analysis
techniques. We then discuss and evaluate our proposed approach using a realistic dataset. Although we are at
the beginning of our research, the current results show the great potential of association-based analysis.

1 INTRODUCTION mation, high demands on computing resources, and


limited automation of analysis queries. In the sta-
Network forensics covers a variety of techniques used tistical approach, the significant packet elements are
for cyber-attack investigation, information gathering, extracted from network traffic and visualized in the
and legal evidence using identification, capture, and form of various statistics charts and overview visual-
analysis of network traffic (Khan et al., 2016). The izations. The main advantage of tools such as Arkime
crucial part is the analysis of collected data (e.g., or Elastic Stack is processing large amounts of net-
packet data or IP flows) to filter and extract the re- work data and providing an overview via interactive
quired information and gain a situational overview. visualizations. Nevertheless, because of the data ag-
Such analysis can be partly automated using anomaly gregation, the analyst has limited access to raw data.
or intrusion detection tools (Fernandes et al., 2018). Our research aims to combine the advantages of
However, these tools may not reveal details impor- both approaches and enable the analyst to investigate
tant to evidence collection, and therefore manual ex- the captured data using interactive visualization. To
ploratory network data analysis plays an important achieve this goal, we introduce the G R A N E F toolkit
role, as it allows analysts to verify detected anomalies, focused on association-based network traffic analysis.
examine contexts, or extract additional information. This method is widely used to analyze real-world ob-
One of the main challenges of the exploratory jects, social networks, or as part of criminal investiga-
analysis of network traffic is the volume of data that tion (Atkin, 2011). It also reflects the way people nat-
faces high computational demands. Besides, foren- urally think (Zhang et al., 2020). In contrast to current
sic analysis requires that the analyst has access to methods focused only on hosts relations, we focus on
all the data, which limits the use of some automated an exploratory analysis of all significant attributes of
tools aggregating the data. Such analysis is typi- collected network traffic data, including connection
cally based on two approaches: interactive raw data properties and application data. The toolkit is based
analysis and statistical analysis. Tools such as Wire- on graph database Dgraph (Dgraph Labs, Inc., 2021)
shark or Network Miner are commonly used in in- capable of storing and analyzing a large volume of
teractive raw data analysis to filter, aggregate, and logs provided by the Zeek (The Zeek Project, 2020)
extract meaningful information. Their disadvantage network security monitor. Unlike interactive raw data
is a limited visualization, amount of obtained infor- analysis, our approach allows the analysts to browse,
filter, and aggregate all collected information and vi-
a
https://orcid.org/0000-0002-0212-6593 sualize the results in a relationship diagram providing
b
© https://orcid.org/0000-0002-3746-5114 a broader context to analyzed data.
2 RELATED WORK The toolkit further consists of tools for data prepro-
cessing as well as their exploratory analysis. These
Commonly used techniques, analysis methods, and data processing and analysis tools are implemented
research directions of network forensics are summa- as standalone modules as Docker containers where
rized in the survey by Khan et al. (Khan et al., 2016). one module can implement more than one tool, or
In addition to a taxonomy proposal, they also sum- one tool can be implemented by more than one mod-
marize the open challenges and discuss possible so- ule, as shown in Figure 1. For example, the indexing
lutions. A well-arranged insight into the area is also and graph database tools, both working directly with
provided by Ric Messier's book (Messier, 2017) pre- a running instance of Dgraph, use the functionality of
senting the whole process of network forensics to- one Data handling module. The Transformation mod-
gether with commonly used tools. The main empha- ule is our custom solution, and the remaining modules
sis is on the practical use of these tools in a real-world are based on the use of already existing tools.
environment, allowing us to better understand the an-
alyst's needs. Besides analysis approaches used in Extraction module Transformation module

the network forensics area, our research is also mo- PCAP • transformation
tivated by criminal investigation processes. To solve
the crime and maintain an overview of the whole case,
criminal investigators typically capture associations
Data handling module
between real-world objects and events through link
analysis (Atkin, 2011). Thanks to this approach, they indexing graph database
can maintain a good overview of the data while pre-
serving all the analysis details, which is also the goal
of network forensics. API module : • Web module

The utilization of graph databases for network analysis


traffic analysis was introduced by Neise (Neise,
2016). He proposed to use Zeek for data extraction
and store the data in the Neo4j graph database (Neo4j, Figure 1: Data pipeline of the G R A N E F toolkit.
2021). To capture the extracted information, Neise
Separation of data processing into standalone
proposed a simple data model, which we further de-
modules allows us to easily replace or update some
velop in our work. Besides, we propose to utilize
modules without changing the remaining, as long as
Dgraph to efficiently store and analyze large amounts
the compatibility with subsequent modules is pre-
of data, which is difficult to achieve in the Neo4j
served. Besides, this approach allows us to store in-
database. The use of Neo4j is also proposed by
termediate results and use them in other analysis tools
Diederichsen et al. (Diederichsen et al., 2019). They,
or speed up the data processing for a new analysis.
however, were focused only on the analysis of con-
nection, DNS, and HTTP logs. They designed a data
model that takes into account all attributes in the form 3.1 Data Extraction
of associations. This approach generates many nodes
and edges, which places huge demands on storage Network traffic captures are initially processed by
and computing capacity. Another example of a graph- Zeek, which extracts information from packet headers
based network traffic analysis is Sec2graph proposed and application layers (e.g., from HTTP, DNS, TLS,
by Leichtnam et al. (Leichtnam et al., 2020). They and SSH protocols) and produces them as log files.
have further developed the approach of Neise and By default, it aggregates packets to connections and
proposed automatic detection of attacks and anoma- stores their characteristics. Individual records across
lies. They did not store the data in a database for ex- log files are linked through a unique connection iden-
ploratory analysis but transformed them into associa- tifier that easily links extracted data as associations.
tions, which they analyzed using machine learning. The advantage of Zeek is the variety of data process-
ing settings and especially the possibility of extend-
ing it with new extraction methods. This functionality
makes it possible to respond to various requirements
3 TOOLKIT DESIGN of network traffic forensics and reflect new trends and
applications. One possible extension to the Extrac-
The central part of the G R A N E F toolkit is graph tion module would be to add the export of transferred
database Dgraph which enables scalable data storage, application data or files. Zeek manages to save cap-
and processing of large-size network traffic captures. tured files in a separate folder, whereas the reference
to these files is retained in the corresponding log. It <Host-data> •*— <host-data/uid> <Application>
is also possible to extend packet analysis scripts and
extract additional information about the connection I I
communicated <host-data> produced
not available in a default configuration. The modu-
larity feature of the G R A N E F toolkit plays an impor-
tant role in this case as it allows us to prepare sev- Host
-originatcd-
Connection
- responded-
eral containers with various configurations and data
processing extensions allowing us to reflect different Figure 2: Simplified database schema showing nodes and
requirements to the current case of network forensics. their associations.

amples of such data are domain names extracted from


3.2 Data Transformation DNS, HTTP, or TLS traffic. Further, it can refer to
transferred files, certificates, or user-agents. It is also
The Transformation module takes log files produced possible to associate external information relevant to
by Zeek, utilized in the previous module, and converts the host, such as details from reputation databases.
them to the R D F triples format (W3C, 2014) accepted The Connection nodes contain information about the
by Dgraph. This conversion of log data is performed network connection, such as its duration, the number
by a custom script that processes selected log files of bytes transferred, relevant ports, and used proto-
record by record. Since each log file has a prede- col. The Application nodes contain application data
fined set of attributes, we can manually decide which extracted from the Connection and may be mutually
ones to transfer to the database and how to treat them. connected by an additional edge. Edge host-data/uid
This approach makes it very easy to incorporate any is present to preserve what Application node created
changes in the design of the database schema or any the associated Host-data node. A l l edges are direc-
information obtained from external sources. Such in- tional but allow reverse processing for querying from
formation can be, for example, an attribute value that an arbitrary node regardless of its type.
indicates that the host with a given IP address has Thanks to the universal definition of the proposed
some property that was discovered during forensics scheme, it is possible to transform other types of data
analysis. This information can also be added later related to network traffic analysis in a similar way
through a unique external identifier given to the node as using the Zeek. A n example is IP flows, which
at the stage of its definition. may currently contain information about individual
The conversion is done according to a scheme connections and can be extended by information ex-
whose simplified form is shown in Figure 2. This tracted from application data (Velan, 2018). Alterna-
scheme is based on Neise (Neise, 2016) and Leicht- tively, it is possible to transform system logs related
nam et al. (Leichtnam et al., 2020), who represent in- to network connections or collected from network de-
dividual logs as separate nodes and connect them with vices. These transformations can be represented as
defined associations. The information contained in separate modules of the toolkit to be easily intercon-
log records is stored in the database as node attributes nected according to the network forensics case.
allowing to perform filtering or aggregation on them.
Compared to previously proposed schemas, we add 3.3 Data Handling
an additional edge communicated between individ-
ual hosts to facilitate the definition of queries focused The core part of data handling is the Dgraph clus-
only on the connection's existence and optimize the ter consisting of two types of computational nodes.
query execution. Communicating hosts are extracted Dgraph Zero controls the cluster and serves as the
from the connection log and represented as separate main component responsible for the orchestration of
nodes. We also simplify edge naming to be uniform the database and analysis. Data processing is per-
throughout all logs and make it easier to query the en- formed by Dgraph Alpha nodes containing indexed
tire schema. The resulting schema is designed to re- data. At least one Zero and Alpha node are needed
flect people's common perception of how a computer to handle stored data. Additional details about the
network works and simplifies analysis as queries can database and data analysis abilities can be found in
be formed at the highest level of abstraction. its documentation (Dgraph Labs, Inc., 2021).
Each node of the schema has an assigned type. The Data handling module consists of indexing
Host nodes represent a device on the network with and graph database components, working directly
a given IP address. These nodes can be associ- with an instance of Dgraph. The indexing compo-
ated with Host-data nodes containing information ex- nent uploads and indexes R D F triples and stores them
tracted from application data related to the host. Ex- in an internal database structure. The main part of
the component is Dgraph Bulk Loader which oper- common tasks of exploratory analysis and are based
ates on the MapReduce concept. It appropriately uti- on both our experience and the steps typically per-
lizes available computational resources. In addition, formed by analysts within our CSIRT team.
the component allows us to specify the number of Al-
{ g e t C o n n ( f u n c : a l l o f ( h o s t . i p , c i d r , "10.10.0.0/16")) {
pha nodes that will be utilized in the following graph
name : h o s t . i p
database component. Large volumes of data can thus host.originated @filter(eq(connection.proto, "tcp")) {
be distributed within the cluster while maintaining the expand(Connection)
ability to perform fast analysis over stored data. Re- connection.produced {
expand(_all_)
sults of the indexing component are binary files stor- files.fuid { expand(File) }
ing both the data and indexes. The advantage of this }
approach is a reduction of data processing time when ~host.responded { responded_ip : host.ip }

it is reloaded. Besides, it is possible to use the gen- >

>}
erated index within another instance of Dgraph de-
ployed on a more powerful computation node. Figure 3: Selection of local network T C P connections and
transferred files using DQL.
The graph database component takes care of man-
aging Dgraph nodes and their communication. Data The web user interface utilizes the API and repre-
provided by the indexing component are loaded to Al- sents its user-friendly extension that allows perform-
pha nodes. The exposed Dgraph user interface al- ing defined queries and supports exploratory analy-
lows, among other things, to perform basic queries sis. The query results are displayed in an interactive
over the data. However, it is not suitable for ex- relationship visualization which uses a force-directed
ploratory analysis as it has only a limited degree of graph layout and allows nodes aggregation to show
interaction. The analyst must also know the specifics large relationship diagrams while preserving a simple
of the query language, which complicates the adapta- overview of the data. Based on our experience, this
tion of the proposed network forensics approach. layout seems to be the best comprehensible. However,
we plan to verify other variants in the future. A n ex-
3.4 Data Analysis ample of such a visualization is shown in Figure 4,
containing one specific connection of response to the
query from Figure 3. This approach supports interac-
Data stored in Dgraph are queried using Dgraph
tivity as the analyst can select nodes or edges, see all
Query Language (DQL) based on GraphQL. A n ex-
attributes, and perform another analytical query over
ample of such a query is provided in Figure 3 con-
them while the result is added to the same visualiza-
taining a selection of T C P connections and trans-
tion or displayed in a new analysis tab. As part of the
ferred files from a local network. A D Q L query finds
exploratory analysis, it is possible to browse through
nodes based on search criteria matching patterns in
the associations between information extracted from
the graph and returns a graph in JSON format (Dgraph
network traffic and observe a context that would oth-
Labs, Inc., 2021). Queries are composed of nested
erwise remain hidden.
blocks; their evaluation starts by finding the initial set
of nodes specified in the query root, against which Host Host
the graph matching is applied. In addition to filter-
ing, D Q L allows variables definition and data aggre-
gation. Thanks to the pre-defined schema, results
are predictable. A disadvantage is that D Q L is not
widespread yet, and the analyst must devote some
time to perform advanced queries. To overcome this
issue, we have created an additional analysis module
providing an abstract layer over DQL. Files
The G R A N E F analysis tool consists of two mod- Figure 4: Visualization of one connection between hosts.
ules: the Application interface (API) module and the
Web user interface module. This approach supports
greater versatility of the entire solution, as it is pos-
sible to connect other systems to the API without the 4 DISCUSSION
need to use a web user interface. The API implements
querying and processing of data stored in Dgraph, To evaluate the toolkit capabilities, we use network
while only filter properties or immersion rates are re- traffic datasets containing realistic scenarios with
quired as input. The provided API functions reflect small-size captures and larger ones with size in the
order of gigabytes. Especially, analysis of large net- the set of nodes we want to focus on. To do so, we
work traffic captures is a typical use-case of network need to understand the nature of as many hosts and
forensics, so we pay more attention to it. In this case, connections as possible to distinguish unusual net-
however, the analyst expects that preprocessing of work traffic. Examples of some queries are "return
such data puts considerable computational demands all connections and protocol types between two spe-
increasing processing time. Therefore, greater em- cific hosts" or "return number of all specified connec-
phasis is on the subsequent analysis, which must be tions for hosts that fall within given CIDR range". We
sufficiently interactive without delays. have also taken advantage of D Q L and defined queries
utilizing aggregation functions, allowing us, for ex-
4.1 Computational Requirements ample, to group all host connections according to the
number of transferred bytes.
To test data processing speed, we have prepared a vir- The result of a query that focused on a subset of
tual machine with Debian OS, 4 V C P U , and 16 G B outgoing TCP connections of one host can be seen in
R A M , which corresponds to today's ordinary hard- Figure 5. A n advantage of such visualization is that it
ware performance. The data processing speed of often allows the analyst to distinguish regular network
a small capture file (Digital Corpora, 2020) with the traffic from suspicious just at first glance based solely
size of several megabytes was affected more by con- on the resulting pattern. In the provided example, it
tainer startup. Nevertheless, the processing took an would be relevant to pay attention to the communi-
average of tens of seconds. To test the processing of cation with the left node. In the subsequent analy-
a larger network capture, we selected a capture from sis step, the analyst can select nodes or a group of
the second day of the CyberCzech exercise (Tovarnak nodes, further explore their associations, and go into
et al., 2020) which is approximately 6 G B in size the graph's depth and explore observed connections.
and contains 330,564 connections. The average pro-
cessing time for this file was approximately 7 min-
utes, with extraction taking approximately 120 sec-
onds, transformation 50 seconds, and indexing 250
seconds. The transformed dataset resulted in 718,475
nodes and 397,632 edges, with an index size of ap-
proximately 820 M B . Although this data processing
time is not critical for network forensics, it is possible
to achieve further improvements by parallelizing the
extraction using multiple Zeek runs or using a bigger
cluster for the data indexing task.
Once the data are indexed, analytical queries are
performed fast, whereas the results are typically re-
turned in one or two seconds. However, the main
challenge is to render the results in the form of rela-
tionship visualization. It is necessary to spread nodes
in a suitable layout to reasonably support the visual \
analysis. Besides, a larger number of nodes place
great computational demands and causes the result- Figure 5: TCP connections in the National Gallery D C Sce-
ing graph to become less clear. For this reason, it is nario dataset (Digital Corpora, 2020).
necessary to allow the grouping of similar nodes so
Besides the mentioned advantages, our experience
that the overall visualization could offer a sufficient
has also shown the challenges that need to be faced
response. We perceive this visualization requirement
with the proposed graph-based network forensics ap-
as a crucial factor of the toolkit, which we plan to fo-
proach. Fast relationship visualization is crucial as
cus on more in future work.
it directly affects the exploratory analysis. Another
challenge we have encountered is taking time percep-
4.2 Exploratory Analysis tion into account. Associations of individual connec-
tions are created independently of the time context.
The main benefit of graph-based network forensics This approach allows the analyst to overview events
is the support of exploratory analysis. The general that have occurred over a longer time. On the other
queries that are part of API follow the analyst's typi- hand, it is necessary to consider the continuity of indi-
cal behavior. In the beginning, it is essential to restrict vidual network connections in certain cases. This can
be achieved through appropriate attribute filtering, but ACKNOWLEDGEMENTS
a challenge is how to make both of these methods ac­
cessible to the analyst. Another challenge associated This project has received funding from the European
with graph analysis is the need for a mindset change Union's Horizon 2020 research and innovation pro­
as analysts are used to other approaches. However, gramme under grant agreement No 833418.
our experience shows that they can naturally analyze
the data provided in this way after a while. This ob­
servation requires a more detailed verification, which
we plan to perform in future work.
REFERENCES
Atkin, H. (2011). Criminal Intelligence: Manual for Ana­
lysts. U N O D C Criminal Intelligence Manual for Ana­
5 CONCLUSION lysts. United Nations Office on Drugs and Crime (UN­
ODC).
Dgraph Labs, Inc. (2021). Native GraphQL Database: The
Graph-based network forensics is a new approach Best Graph D B I Dgraph. https://dgraph.io/. A c ­
to analyzing network traffic data utilizing mod­ cessed: 2021-01-21.
ern database technologies capable of storing large Diederichsen, L . , Choo, K . - K . R., and Le-Khac, N . - A .
amounts of information based on their associations. (2019) . A Graph Database-Based Approach to Ana­
It follows the typical way of human thinking and lyze Network Log Files. In Network and System Secu­
perception of the characteristics of the surrounding rity, pages 53-73. Springer International Publishing.
world. Its main advantage is the connection of ex­ Digital Corpora (2020). The 2012 National Gallery D C Sce­
ploratory analysis of network traffic data with results nario. https://digitalcorpora.org/corpora/scenarios/
national-gallery-dc-2012-attack. Accessed: 2021-01-
visualization allowing analysts to easily go through
21.
the acquired knowledge and visually identify interest­
Fernandes, G., Rodrigues, J. J. P. C , Carvalho, L . F , A l -
ing network traffic. Our experience also shows that
Muhtadi, J. F , and Proenca, M . L . (2018). A com­
this approach is not only the new method of data stor­ prehensive survey on network anomaly detection.
age and querying, but it is a shift of mindset that al­ Telecommunication Systems.
lows us to perceive network data in a new way. Khan, S., Gani, A., Wahab, A . W. A., Shiraz, M . , and Ah­
In this paper, we introduced the G R A N E F toolkit mad, I. (2016). Network forensics: Review, taxon­
utilizing Dgraph database that stores transformed in­ omy, and open challenges. Journal of Network and
Computer Applications, 66:214-235.
formation from network traffic captures extracted by
Leichtnam, L . , Totel, E., Prigent, N . , and Mé, L . (2020).
Zeek network security monitor. The stored data are
Sec2graph: Network Attack Detection Based on Nov­
presented to the user via a web-based user interface elty Detection on Graph Structured Data. In Detection
that provides an abstraction layer above the database of Intrusions and Malware, and Vulnerability Assess­
query language and allows the user to efficiently ment, pages 238-258. Springer International Publish­
query data, visualize results in the form of a relation­ ing.
ship diagram, and perform exploratory analysis. Messier, R. (2017). Network Forensics. John Wiley & Sons,
Our aim of the provided toolkit description was Ltd.
to introduce a new approach to network forensics Neise, P. (2016). Intrusion Detection Through Relationship
and incident investigation and describe this solution's Analysis. Technical report, S A N S Institute.
specifics. As part of future work, we want to further Neo4j (2021). Neo4j Graph Platform - The Leader in Graph
Databases, https://neo4j.com. Accessed: 2021-01-30.
compare this approach with other typically used an­
alytical methods, both in terms of functionality and The Zeek Project (2020). The Zeek Network Security Mon­
itor. https://zeek.org/. Accessed: 2021-01-21.
analyst's behavior. Furthermore, we plan to focus on
the definition of new methods for automatic analysis Tovarňák, D., Špaček, S., and Vykopal, J. (2020). Traffic
and log data captured during a cyber defense exercise.
of network traffic based on the associations provided
Data in Brief, 31.
by our proposed data model. We also see great po­
tential in connecting various data types and sources, Velan, P. (2018). Application-Aware Flow Monitoring.
Doctoral theses, dissertations, Masaryk University,
which could create a unified analytical environment
Faculty of Informatics, Brno.
allowing us to analyze the data obtained from hosts
W3C(2014). R D F 1.1 N-Triples. https://www.w3.org/TR/
and network traffic in one place. The first evaluation
n-triples/. Accessed: 2021-01-21.
results of the proposed approach demonstrate its great
potential for network forensics and generally for ex­ Zhang, H . , Zeng, H . , Priimagi, A . , and Ikkala, O.
(2020) . Viewpoint: Pavlovian Materials—Functional
ploratory analysis of network traffic data.
Biomimetics Inspired by Classical Conditioning. Ad­
vanced Materials, 32(20).

You might also like