2021 SECRYPT Granef Utilization of A Graph Database For Network Forensics Paper - Archive
2021 SECRYPT Granef Utilization of A Graph Database For Network Forensics Paper - Archive
Abstract: Understanding the information in captured network traffic, extracting the necessary data, and performing inci-
dent investigations are principal tasks of network forensics. The analysis of such data is typically performed by
tools allowing manual browsing, filtering, and aggregation or tools based on statistical analyses and visualiza-
tions facilitating data comprehension. However, the human brain is used to perceiving the data in associations,
which these tools can provide only in a limited form. We introduce a G R A N E F toolkit that demonstrates
a new approach to exploratory network data analysis based on associations stored in a graph database. In this
article, we describe data transformation principles, utilization of a scalable graph database, and data analysis
techniques. We then discuss and evaluate our proposed approach using a realistic dataset. Although we are at
the beginning of our research, the current results show the great potential of association-based analysis.
the network forensics area, our research is also mo- PCAP • transformation
tivated by criminal investigation processes. To solve
the crime and maintain an overview of the whole case,
criminal investigators typically capture associations
Data handling module
between real-world objects and events through link
analysis (Atkin, 2011). Thanks to this approach, they indexing graph database
can maintain a good overview of the data while pre-
serving all the analysis details, which is also the goal
of network forensics. API module : • Web module
>}
erated index within another instance of Dgraph de-
ployed on a more powerful computation node. Figure 3: Selection of local network T C P connections and
transferred files using DQL.
The graph database component takes care of man-
aging Dgraph nodes and their communication. Data The web user interface utilizes the API and repre-
provided by the indexing component are loaded to Al- sents its user-friendly extension that allows perform-
pha nodes. The exposed Dgraph user interface al- ing defined queries and supports exploratory analy-
lows, among other things, to perform basic queries sis. The query results are displayed in an interactive
over the data. However, it is not suitable for ex- relationship visualization which uses a force-directed
ploratory analysis as it has only a limited degree of graph layout and allows nodes aggregation to show
interaction. The analyst must also know the specifics large relationship diagrams while preserving a simple
of the query language, which complicates the adapta- overview of the data. Based on our experience, this
tion of the proposed network forensics approach. layout seems to be the best comprehensible. However,
we plan to verify other variants in the future. A n ex-
3.4 Data Analysis ample of such a visualization is shown in Figure 4,
containing one specific connection of response to the
query from Figure 3. This approach supports interac-
Data stored in Dgraph are queried using Dgraph
tivity as the analyst can select nodes or edges, see all
Query Language (DQL) based on GraphQL. A n ex-
attributes, and perform another analytical query over
ample of such a query is provided in Figure 3 con-
them while the result is added to the same visualiza-
taining a selection of T C P connections and trans-
tion or displayed in a new analysis tab. As part of the
ferred files from a local network. A D Q L query finds
exploratory analysis, it is possible to browse through
nodes based on search criteria matching patterns in
the associations between information extracted from
the graph and returns a graph in JSON format (Dgraph
network traffic and observe a context that would oth-
Labs, Inc., 2021). Queries are composed of nested
erwise remain hidden.
blocks; their evaluation starts by finding the initial set
of nodes specified in the query root, against which Host Host
the graph matching is applied. In addition to filter-
ing, D Q L allows variables definition and data aggre-
gation. Thanks to the pre-defined schema, results
are predictable. A disadvantage is that D Q L is not
widespread yet, and the analyst must devote some
time to perform advanced queries. To overcome this
issue, we have created an additional analysis module
providing an abstract layer over DQL. Files
The G R A N E F analysis tool consists of two mod- Figure 4: Visualization of one connection between hosts.
ules: the Application interface (API) module and the
Web user interface module. This approach supports
greater versatility of the entire solution, as it is pos-
sible to connect other systems to the API without the 4 DISCUSSION
need to use a web user interface. The API implements
querying and processing of data stored in Dgraph, To evaluate the toolkit capabilities, we use network
while only filter properties or immersion rates are re- traffic datasets containing realistic scenarios with
quired as input. The provided API functions reflect small-size captures and larger ones with size in the
order of gigabytes. Especially, analysis of large net- the set of nodes we want to focus on. To do so, we
work traffic captures is a typical use-case of network need to understand the nature of as many hosts and
forensics, so we pay more attention to it. In this case, connections as possible to distinguish unusual net-
however, the analyst expects that preprocessing of work traffic. Examples of some queries are "return
such data puts considerable computational demands all connections and protocol types between two spe-
increasing processing time. Therefore, greater em- cific hosts" or "return number of all specified connec-
phasis is on the subsequent analysis, which must be tions for hosts that fall within given CIDR range". We
sufficiently interactive without delays. have also taken advantage of D Q L and defined queries
utilizing aggregation functions, allowing us, for ex-
4.1 Computational Requirements ample, to group all host connections according to the
number of transferred bytes.
To test data processing speed, we have prepared a vir- The result of a query that focused on a subset of
tual machine with Debian OS, 4 V C P U , and 16 G B outgoing TCP connections of one host can be seen in
R A M , which corresponds to today's ordinary hard- Figure 5. A n advantage of such visualization is that it
ware performance. The data processing speed of often allows the analyst to distinguish regular network
a small capture file (Digital Corpora, 2020) with the traffic from suspicious just at first glance based solely
size of several megabytes was affected more by con- on the resulting pattern. In the provided example, it
tainer startup. Nevertheless, the processing took an would be relevant to pay attention to the communi-
average of tens of seconds. To test the processing of cation with the left node. In the subsequent analy-
a larger network capture, we selected a capture from sis step, the analyst can select nodes or a group of
the second day of the CyberCzech exercise (Tovarnak nodes, further explore their associations, and go into
et al., 2020) which is approximately 6 G B in size the graph's depth and explore observed connections.
and contains 330,564 connections. The average pro-
cessing time for this file was approximately 7 min-
utes, with extraction taking approximately 120 sec-
onds, transformation 50 seconds, and indexing 250
seconds. The transformed dataset resulted in 718,475
nodes and 397,632 edges, with an index size of ap-
proximately 820 M B . Although this data processing
time is not critical for network forensics, it is possible
to achieve further improvements by parallelizing the
extraction using multiple Zeek runs or using a bigger
cluster for the data indexing task.
Once the data are indexed, analytical queries are
performed fast, whereas the results are typically re-
turned in one or two seconds. However, the main
challenge is to render the results in the form of rela-
tionship visualization. It is necessary to spread nodes
in a suitable layout to reasonably support the visual \
analysis. Besides, a larger number of nodes place
great computational demands and causes the result- Figure 5: TCP connections in the National Gallery D C Sce-
ing graph to become less clear. For this reason, it is nario dataset (Digital Corpora, 2020).
necessary to allow the grouping of similar nodes so
Besides the mentioned advantages, our experience
that the overall visualization could offer a sufficient
has also shown the challenges that need to be faced
response. We perceive this visualization requirement
with the proposed graph-based network forensics ap-
as a crucial factor of the toolkit, which we plan to fo-
proach. Fast relationship visualization is crucial as
cus on more in future work.
it directly affects the exploratory analysis. Another
challenge we have encountered is taking time percep-
4.2 Exploratory Analysis tion into account. Associations of individual connec-
tions are created independently of the time context.
The main benefit of graph-based network forensics This approach allows the analyst to overview events
is the support of exploratory analysis. The general that have occurred over a longer time. On the other
queries that are part of API follow the analyst's typi- hand, it is necessary to consider the continuity of indi-
cal behavior. In the beginning, it is essential to restrict vidual network connections in certain cases. This can
be achieved through appropriate attribute filtering, but ACKNOWLEDGEMENTS
a challenge is how to make both of these methods ac
cessible to the analyst. Another challenge associated This project has received funding from the European
with graph analysis is the need for a mindset change Union's Horizon 2020 research and innovation pro
as analysts are used to other approaches. However, gramme under grant agreement No 833418.
our experience shows that they can naturally analyze
the data provided in this way after a while. This ob
servation requires a more detailed verification, which
we plan to perform in future work.
REFERENCES
Atkin, H. (2011). Criminal Intelligence: Manual for Ana
lysts. U N O D C Criminal Intelligence Manual for Ana
5 CONCLUSION lysts. United Nations Office on Drugs and Crime (UN
ODC).
Dgraph Labs, Inc. (2021). Native GraphQL Database: The
Graph-based network forensics is a new approach Best Graph D B I Dgraph. https://dgraph.io/. A c
to analyzing network traffic data utilizing mod cessed: 2021-01-21.
ern database technologies capable of storing large Diederichsen, L . , Choo, K . - K . R., and Le-Khac, N . - A .
amounts of information based on their associations. (2019) . A Graph Database-Based Approach to Ana
It follows the typical way of human thinking and lyze Network Log Files. In Network and System Secu
perception of the characteristics of the surrounding rity, pages 53-73. Springer International Publishing.
world. Its main advantage is the connection of ex Digital Corpora (2020). The 2012 National Gallery D C Sce
ploratory analysis of network traffic data with results nario. https://digitalcorpora.org/corpora/scenarios/
national-gallery-dc-2012-attack. Accessed: 2021-01-
visualization allowing analysts to easily go through
21.
the acquired knowledge and visually identify interest
Fernandes, G., Rodrigues, J. J. P. C , Carvalho, L . F , A l -
ing network traffic. Our experience also shows that
Muhtadi, J. F , and Proenca, M . L . (2018). A com
this approach is not only the new method of data stor prehensive survey on network anomaly detection.
age and querying, but it is a shift of mindset that al Telecommunication Systems.
lows us to perceive network data in a new way. Khan, S., Gani, A., Wahab, A . W. A., Shiraz, M . , and Ah
In this paper, we introduced the G R A N E F toolkit mad, I. (2016). Network forensics: Review, taxon
utilizing Dgraph database that stores transformed in omy, and open challenges. Journal of Network and
Computer Applications, 66:214-235.
formation from network traffic captures extracted by
Leichtnam, L . , Totel, E., Prigent, N . , and Mé, L . (2020).
Zeek network security monitor. The stored data are
Sec2graph: Network Attack Detection Based on Nov
presented to the user via a web-based user interface elty Detection on Graph Structured Data. In Detection
that provides an abstraction layer above the database of Intrusions and Malware, and Vulnerability Assess
query language and allows the user to efficiently ment, pages 238-258. Springer International Publish
query data, visualize results in the form of a relation ing.
ship diagram, and perform exploratory analysis. Messier, R. (2017). Network Forensics. John Wiley & Sons,
Our aim of the provided toolkit description was Ltd.
to introduce a new approach to network forensics Neise, P. (2016). Intrusion Detection Through Relationship
and incident investigation and describe this solution's Analysis. Technical report, S A N S Institute.
specifics. As part of future work, we want to further Neo4j (2021). Neo4j Graph Platform - The Leader in Graph
Databases, https://neo4j.com. Accessed: 2021-01-30.
compare this approach with other typically used an
alytical methods, both in terms of functionality and The Zeek Project (2020). The Zeek Network Security Mon
itor. https://zeek.org/. Accessed: 2021-01-21.
analyst's behavior. Furthermore, we plan to focus on
the definition of new methods for automatic analysis Tovarňák, D., Špaček, S., and Vykopal, J. (2020). Traffic
and log data captured during a cyber defense exercise.
of network traffic based on the associations provided
Data in Brief, 31.
by our proposed data model. We also see great po
tential in connecting various data types and sources, Velan, P. (2018). Application-Aware Flow Monitoring.
Doctoral theses, dissertations, Masaryk University,
which could create a unified analytical environment
Faculty of Informatics, Brno.
allowing us to analyze the data obtained from hosts
W3C(2014). R D F 1.1 N-Triples. https://www.w3.org/TR/
and network traffic in one place. The first evaluation
n-triples/. Accessed: 2021-01-21.
results of the proposed approach demonstrate its great
potential for network forensics and generally for ex Zhang, H . , Zeng, H . , Priimagi, A . , and Ikkala, O.
(2020) . Viewpoint: Pavlovian Materials—Functional
ploratory analysis of network traffic data.
Biomimetics Inspired by Classical Conditioning. Ad
vanced Materials, 32(20).