Log Data Normalization for Event Streams
Log Data Normalization for Event Streams
Abstract—Monitoring plays a crucial role in the operation ently define real-time continuous queries over homogenous
of any sizeable distributed IT infrastructure. Whether it is a streams of properly defined monitoring events. The continuous
university network or cloud datacenter, monitoring information queries would be used to detect complex events, for example,
is continuously used in a wide spectrum of ways ranging from
mission-critical jobs, e.g. accounting or incident handling, to one thousand of unsuccessful logins of user root in 5 minutes,
equally important development-related tasks, e.g. debugging or representing patterns of simpler events present in the monitoring
fault-detection. Whilst pursuing a novel vision of new-generation information, e.g. user login in this case. Bluntly put, a holistic
event-driven monitoring systems, we have identified that a application of the Complex Event Processing approach to the
particularly rich source of monitoring information, computer monitoring and log analysis domain.
logs, is also one of the most problematic in terms of automated
processing. Log data are predominantly generated in an ad- II. C ONTEXT AND P ROBLEM S TATEMENT
hoc manner using a variety of incompatible formats with the
most important pieces of information, i.e. log messages, in the In the context of our work, an ideal state of affairs would
form of unstructured strings. This clashes with our long-term be if all the log data, generated by the given IT infrastructure,
goal of designing a system enabling its users to transparently were accessible in an interoperable and scalable manner as
define real-time continuous queries over homogeneous streams
of properly defined monitoring event objects with explicitly streams of structured event objects. Structured event object is
a serialized piece of data, representing an occurrence, which is
described structure. Our goal is to bridge this gap by normalizing
the poorly structured log data into streams of structured event described via an explicit and strict data schema. Event stream
objects. The combined challenge of this goal is structuring the is an infinite sequence of such objects adhering to the same
log data, whilst considering the high velocity with which they are
schema. This fact would allow for all of the log data to be
generated in modern IT infrastructures. This paper summarizes
the contributions of a dissertation thesis „Normalization of Un- accessed in a transparent and unified way within the notion
structured Log Data into Streams of Structured Event Objects“ of a loosely coupled event-driven architecture. As a result,
dealing with the matter at hand in detail. the log data consumers would not only be able to directly
Index Terms—log management, logging, data integration, nor- utilize the Complex Event Processing approach for advanced
malization, stream processing, monitoring correlation and monitoring queries, but it would be also possible
to research novel monitoring approaches, e.g. based on machine
I. I NTRODUCTION learning, pattern mining, or predictive modelling. All of this
Computer logs are one of the few mechanisms available for over high-quality source data.
gaining visibility into the behavior of an IT infrastructure and In its current state, log data are continuously generated
its elements. They are also considered to be one of the richest at high rates by many distributed producers using several
and most valuable sources of such behavior-related monitoring transport protocols and many heterogeneous representations.
information. However, log data are repeatedly reported to be of Moreover, a predominant portion of log entries takes the form
poor quality, mainly because a considerable portion of logs is of unstructured or semi-structured data with the main piece of
unstructured by nature. This renders them to be unsuitable for information, i.e. log message, represented as a free-form string
straightforward automated processing and analysis. In many mixing natural language with run-time context variables.
cases, even semi-structured log data can be considered sub- We propose to close this gap by the means of data normal-
optimal for direct processing, i.e. when being processed by ization, i.e. transformation and unification of data transport,
systems that expect some kind of schema to be imposed data representation, data types, and data structures resulting
on the processed data. During the operation of any modern in a common format. Normalization is a recognized data
IT infrastructure, vast floods of heterogeneous log data are integration pattern in the context of message-driven and, in turn,
generated by many distributed producers spread across the event-driven architectures. The presented dissertation thesis [1]
infrastructure’s layers and tiers. deals with multiple knowledge gaps in areas inherent to the
These facts directly clash with a vision of a new-generation normalization of heterogeneous low-grade log data into streams
event-driven monitoring system enabling its users to transpar- of properly defined event objects.
672 2019 IFIP/IEEE International Symposium on Integrated Network Management (IM2019): Dissertation Sessions
Table I
E XAMPLE OF MESSAGE TYPE DISCOVERY IN THE TASK OF LOG ABSTRACTION
IV. L OG A BSTRACTION – M ESSAGE T YPE D ISCOVERY We refer to the algorithm as to the Extended Nagappan-
The discovery of message types for the purposes of log Vouk (ENG) since it is based on the original idea of frequency
abstraction is a tiresome process when done entirely manually. table and intra-message word frequency proposed in [2]. Other
Log data generated by a single application or software than that, the algorithm is significantly improved in order to
project can contain hundreds of unique message types since support multiple delimiters for tokenization, support multi-word
their logging code can contain hundreds of unique logging variables, report distinct message types with no overlapping,
statements. For this reason, the research in this respect is and finally, the algorithm can be parameterized via a single
focused on automated approaches for message type discovery. parameter controlling its sensitivity and the granularity of the
In the literature, two orthogonal groups of approaches can be reported message types. The discovery algorithm is able to
identified – the message type discovery can be based either on generate pattern sets in a special format directly suitable for log
source code analysis, or on data mining techniques. abstraction via pattern matching as can be seen in Listing 1.
Since the source code of the targeted log data producer may
regexes: # regex tokens
not be always available for analysis, which is true especially INT: [integer, "[0-9]+"]
for proprietary hardware and software appliances, we have BOOL: [boolean, "\btrue\b|\bfalse\b"]
WORD: [string, "[0-9a-zA-Z]+"]
turned our attention towards approaches that discover message
types from historical log data via data mining techniques. patterns: # patterns describing the message types
grp0:
A number of works emerged in this area over the years, mt1: 'User %{WORD:var1} logged %{WORD:var2}'
utilizing different approaches primarily based either on cluster mt2: 'Service %{WORD:var1} started'
2019 IFIP/IEEE International Symposium on Integrated Network Management (IM2019): Dissertation Sessions 673
accuracy of four different message type discovery algorithms the matching patterns created for this purpose? Tree-based
(SLCT, IPLoM, LKE, and logSig). The data sets were randomly approaches address the problem of multi-pattern matching in a
sampled for 2000 log messages from each dataset in order to more straightforward way by organizing the matching patterns
shorten the running times of some of the more computationally in various tree-like structures with the goal of segmenting and
intensive algorithms. The ground truth (gold standard) was limiting the searched pattern-space. This tree-like organization
created manually by the authors. The reported results (F- can be either inter-pattern, i.e. organizing the individual patterns
measures) of the evaluated algorithms as well as results our as a whole with respect to some observed knowledge, or intra-
algorithm are summarized in Table II. pattern, i.e. organizing the individual pattern components, words
It can be seen that the proposed algorithm exhibits a superior for example, into a tree-like matching structure.
accuracy in an evaluation based on five real-world data sets We have designed an elegant multi-pattern matching algo-
with externally provided ground truth. When using its default rithm based on a clever intra-pattern organization that is able to
settings (ENG), the algorithm achieved very high accuracy with practically eliminate the need for multi-regex matching, whilst
an average F-measure of 0.953. When considering the best imposing only minimal limitations on the way the matching
algorithm settings for each data-set (ENG*), it exhibited an patterns can be created. The basic idea of our approach is
average F-measure of 0.996. based on organizing the pattern set into a special data structure
we refer to as regex trie (REtrie) as seen on Figure 1.
Table II
F- MEASURES FOR EVALUATED ALGORITHMS
674 2019 IFIP/IEEE International Symposium on Integrated Network Management (IM2019): Dissertation Sessions
VI. E ND - TO -E ND L OG DATA N ORMALIZATION 1) Prototype-Based Normalization: The designed normal-
ization approach can be described as a series of object-
From the data integration perspective, normalization can be to-object transformations, which is partially based on the
performed on four different translation layers. The transport notion of prototype-based object inheritance, sometimes also
layer determines the way the data are transferred over the referred to as prototype-based programming. Prototype-based
network. The second layer determines the data representation, programming is a variant of object-oriented programming in
i.e. how the data are serialized into individual elements, conse- which new objects are created by reusing attributes of existing
quently determining if they are unstructured, semi-structured, objects, which serve as prototypes [5]. There is no notion of
or structured. The data types layer is extremely important since instantiation as in the case of class-based programming.
it defines the data types on which the domain model is based. In our approach, every piece of data intended for normaliza-
The fourth, data structures layer, describes a top-level domain tion starts as an object with a properly defined object type it
model, i.e. what logical entities will be dealt with and what belongs to. As soon as an object is created/constructed, it is
relationships will they have, if any. In terms of data integration, immutable, i.e. the object and its attributes cannot be further
the most loosely coupled outcome of normalization takes the modified. The only way to achieve such a modification is to
form of a Canonical Data Model, i.e. a common data format clone the existing object, referred to as the prototype, and per-
unifying the three top layers – the bottom layer is assumed to form a finite sequence of attribute manipulations, i.e. additions,
be based on messaging. In our case, the Canonical Data Model deletions, and transformations, which will subsequently result
is represented by structured event objects and their individual in the construction of a new immutable object that is based on
types, whose data schemes are explicitly defined. the prototype. The typed data objects that are the result of this
In the course of the normalization process, the parsing of object-to-object transformation represent the normalized event
different formats of log entries and abstraction of log messages records that can be serialized into structured event objects and
is only one type of many different tasks that are actually needed exposed as data streams.
to transform the log data into the desired state. Other common 2) Domain-Specific Language: The simple domain-specific
tasks that are somewhat inherent to the log data normalization language that implements the normalization approach presented
process include: input and output adaptation, data serialization above is based on YAML data format and the actual compila-
and deserialization, parsing, transformation, and enrichment. tion/execution logic is backed by Erlang programming language.
Whilst some of the already existing log management tools The normalization logic is described via a transformation
are quite capable and they support many of these tasks, in descriptor written in the DSL, which is then compiled into a
one form or another, they have very limited capabilities in sequence of instructions that can be executed in Erlang. During
terms of structuring the log data into event objects, which the compilation, a basic type-checking is performed and an
mainly stems from their orientation on basic semi-structured explicit type information and external schemes are generated,
data manipulations and predominantly untyped nature of their which are describing all the defined object types. This means
corresponding domain-specific languages. Although we have that it is possible to enumerate all the event/object types that can
been able to implement end-to-end log data normalization logic be yielded by the normalization process before the execution.
by using these tools, it was always at the cost of manual type 3) Normalization Engine: We have aimed at a minimalistic
enforcement, ex-post schema definition, and combination with design of the normalization engine with the goal of keeping the
additional external functionality, which was rather error-prone. necessary requirements to a bare minimum. The engine, written
in Erlang, instantiates the input adapters as per their definition in
the transformation descriptor, executes the transformation logic,
A. Our Approach
and serializes the resulting event records via data serialization
To address the problems pointed out above we have first format of choice. The engine is also responsible for schema
designed an abstract log data normalization approach that generation. The normalized event objects are then written into
allows for the data transformations to be carried out in a a messaging system via an output adapter. Currently, Apache
statically typed object-oriented paradigm, instead of being Kafka serves as the primary delivery system.
oriented on dynamically typed or untyped transformations
<137>Apr 5 19:31:10 serena audd[631]: User xtovarn logged in
of associative arrays, as it is common in practice. Then, we
have created a domain-specific language and related execution
UserSession() {
logic implementing this approach that is covering the most syslog=SyslogInfo() {
common transformation operations with specific orientation on timestamp=1459877470000, severity=1, facility=17,
hostname="serena", app_name="audd", procid=631
data lacking explicit structural information, i.e. unstructured },
and semi-structured data. Last, but not least, we have created user="xtovarn",
action="LOGIN"
a normalization engine prototype that is able to perform }
this execution logic whilst handling the tasks that are not
necessarily the responsibility of the DSL, e.g. data serialization, Listing 2: Example of an unstructured Syslog log entry with log
or timekeeping. A simple result of log data normalization, as message in natural language and a corresponding normalized
discussed throughout this paper, is illustrated by Listing 2. event object representing a successful user login
2019 IFIP/IEEE International Symposium on Integrated Network Management (IM2019): Dissertation Sessions 675
B. Evaluation Summary ACKNOWLEDGEMENTS
In a real-world setting and the context of online data process- The publication of this paper and the follow-up research
ing we consider throughput to be one of the most important was supported by the ERDF „CyberSecurity, CyberCrime and
performance metrics. We have evaluated the presented approach Critical Information Infrastructures Center of Excellence“ (No.
in terms of end-to-end throughput on real-world data sets for CZ.02.1.01/0.0/0.0/16_019/0000822).
a workload consisting mainly of log message abstraction. The
R EFERENCES
performed experiments showed that the approach is able to
normalize approximately two hundred thousand unstructured [1] D. Tovarnak, “Normalization of Unstructured Log Data into Streams
of Structured Event Objects [online],” Dissertation thesis, Masaryk
log entries per second, with the normalization engine running University, Faculty of Informatics, Brno, 2017, available from <https:
on a single commodity server, and the delivery system running //is.muni.cz/th/rjfzq/thesis-twoside-final-bw.pdf> [cit. 2018-10-28].
on three dedicated machines. The hardware setup of the [2] M. Nagappan and M. A. Vouk, “Abstracting log lines to log event types
for mining software system logs,” in 2010 7th IEEE Working Conference
benchmarking cluster is shown in Table III. on Mining Software Repositories (MSR 2010), 2010, pp. 114–117.
[3] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information
Table III Retrieval. Cambridge University Press, 2008.
H ARDWARE SETUP FOR THE CONDUCTED EXPERIMENTS [4] P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “An evaluation study on
log parsing and its use in log mining,” in 2016 46th Annual IEEE/IFIP
Node type (#) Hardware International Conference on Dependable Systems and Networks (DSN),
Benchmarking • Intel® Xeon® E5410 @ 2.33GHz 2016, pp. 654–661.
(1×) • 4 cores, 16GB RAM, SATA7.2k [5] M. Abadi and L. Cardelli, A Theory of Objects, 1st ed. Springer-Verlag
New York, Inc., 1996.
Normalization • Intel® Xeon® E5-2650 v2 @ 2.60GHz
[6] D. Tovarnak and T. Pitner, “Towards Multi-tenant and Interoperable
(1×) • 8/16 HT cores, 64GB RAM, SATA7.2k
Monitoring of Virtual Machines in Cloud,” in Proceedings of
Messaging • 2 × AMD Opteron™ 4284 @ 3.0GHz
the 14th International Symposium on Symbolic and Numeric
(3×) • 2 × 8 cores, 64GB RAM, SATA7.2k
Algorithms for Scientific Computing, ser. SYNASC ’12. IEEE
Computer Society, 2012, pp. 436–442. [Online]. Available: http:
//dx.doi.org/10.1109/SYNASC.2012.55
VII. C ONCLUSION AND F UTURE W ORK [7] D. Tovarnak, A. Vasekova, S. Novak, and T. Pitner, “Structured and
Interoperable Logging for the Cloud Computing Era: The Pitfalls and
The thesis [1] summarized in this paper represents a com- Benefits,” in Proceedings of the 2013 IEEE/ACM 6th International
prehensive material dealing with one of the richest sources of Conference on Utility and Cloud Computing, ser. UCC ’13, 2013, pp.
behavior-related monitoring information, i.e. log data. Although 91–98.
[8] D. Tovarnak, “Towards Distributed Event-driven Monitoring Architecture
the value of log data is widely recognized, so is their poor [online],” Ph.D. thesis proposal, Masaryk University, Faculty of Infor-
quality, which is rendering them unsuitable for automated matics, Brno, 2013, available from <http://theses.cz/id/0jawn5/?lang=en>
processing. In this work, we have dealt with the primarily [cit. 2017-02-02].
[9] D. Tovarnak, F. Nguyen, and T. Pitner, “Distributed Event-Driven Model
unstructured nature of log data, and especially log messages, for Intelligent Monitoring of Cloud Datacenters,” in Proceedings of the
which typically represent the most important information 7th International Symposium on Intelligent Distributed Computing, ser.
present in the generated log entries. IDC ’13. Springer International Publishing, 2014, pp. 87–92. [Online].
Available: http://dx.doi.org/10.1007/978-3-319-01571-2_11
We have addressed the matter at hand by improving the [10] F. Nguyen, D. Tovarnak, and T. Pitner, “Semantically Partitioned
quality of log data, their structure, representation, and the Peer to Peer Complex Event Processing,” in Proceedings of the 7th
way they can be accessed, by their normalization into fully International Symposium on Intelligent Distributed Computing, ser. IDC
’13. Springer International Publishing, 2014, pp. 55–65. [Online].
structured event objects with defined data schemes, which can Available: http://dx.doi.org/10.1007/978-3-319-01571-2_8
be exposed as data streams. The results related to this thesis [11] D. Tovarnak and T. Pitner, “Continuous Queries Over Distributed Streams
were published on multiple occasions [6], [7], [8], [9], [10], of Heterogeneous Monitoring Data in Cloud Datacenters,” in Proceedings
of the 9th International Joint Conference on Software Technologies -
[11], [12], [13], [14], [15]. Volume 1: ICSOFT-EA, ser. ICSOFT ’14, INSTICC. SciTePress, 2014,
The achieved results offer virtually endless possibilities with pp. 470–481.
respect to new approaches for log data analysis, correlation, [12] D. Tovarnak, “Practical Multi-Pattern Matching Approach for Fast and
Scalable Log Abstraction,” in Proceedings of the 11th International
storage, mining, pattern detection, prediction, root cause analy- Joint Conference on Software Technologies - Volume 1: ICSOFT-EA, ser.
sis, or machine learning in many application areas. In addition, ICSOFT ’16, INSTICC. SciTePress, 2016, pp. 319–329.
thanks to the proposed concepts, it is possible to implement [13] M. Cermak, D. Tovarnak, M. Lastovicka, and P. Celeda, “A Performance
Benchmark for NetFlow Data Analysis on Distributed Stream Processing
an architecture that allows for ingestion and normalization of Systems,” in Proceedings of the 2016 IEEE/IFIP Network Operations
large amounts of heterogeneous monitoring data into a central and Management Symposium, ser. NOMS ’16, 2016, pp. 919–924.
location, rendering them readily available for real-time analysis, [14] T. Jirsik, M. Cermak, D. Tovarnak, and P. Celeda, “Toward Stream-Based
IP Flow Analysis,” IEEE Communications Magazine, vol. 55, no. 7, pp.
detection, alerting, and long-term retention. 70–76, 2017.
We plan to reap the benefits of such a unified access to [15] J. Vykopal, R. Oslejsek, P. Celeda, M. Vizvary, and D. Tovarnak, “KYPO
high-quality event data in our future endeavours. One of our Cyber Range: Design and Use Cases,” in Proceedings of the 12th
International Conference on Software Technologies - Volume 1: ICSOFT,
biggest ambitions in this area is the utilization of the presented ser. ICSOFT ’17, INSTICC. SciTePress, 2017, pp. 310–321.
results for a holistic realization of the distributed event-driven
monitoring architecture for real-time security monitoring based
on information from corresponding log data producers and
other important security information sources, e.g. IP flows.
676 2019 IFIP/IEEE International Symposium on Integrated Network Management (IM2019): Dissertation Sessions