Log Data Normalization for Event Streams

research paper on root cause

Uploaded by

Anukampa Behera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views6 pages

Log Data Normalization for Event Streams

research paper on root cause

Uploaded by

Anukampa Behera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Normalization of Unstructured Log Data into

Streams of Structured Event Objects

Daniel Tovarňák Tomáš Pitner
Institute of Computer Science Faculty of Informatics
Masaryk University Masaryk University
Brno, Czech Republic Brno, Czech Republic
tovarnak@ics.muni.cz tomp@fi.muni.cz

Abstract—Monitoring plays a crucial role in the operation ently define real-time continuous queries over homogenous
of any sizeable distributed IT infrastructure. Whether it is a streams of properly defined monitoring events. The continuous
university network or cloud datacenter, monitoring information queries would be used to detect complex events, for example,
is continuously used in a wide spectrum of ways ranging from
mission-critical jobs, e.g. accounting or incident handling, to one thousand of unsuccessful logins of user root in 5 minutes,
equally important development-related tasks, e.g. debugging or representing patterns of simpler events present in the monitoring
fault-detection. Whilst pursuing a novel vision of new-generation information, e.g. user login in this case. Bluntly put, a holistic
event-driven monitoring systems, we have identified that a application of the Complex Event Processing approach to the
particularly rich source of monitoring information, computer monitoring and log analysis domain.
logs, is also one of the most problematic in terms of automated
processing. Log data are predominantly generated in an ad- II. C ONTEXT AND P ROBLEM S TATEMENT
hoc manner using a variety of incompatible formats with the
most important pieces of information, i.e. log messages, in the In the context of our work, an ideal state of affairs would
form of unstructured strings. This clashes with our long-term be if all the log data, generated by the given IT infrastructure,
goal of designing a system enabling its users to transparently were accessible in an interoperable and scalable manner as
define real-time continuous queries over homogeneous streams
of properly defined monitoring event objects with explicitly streams of structured event objects. Structured event object is
a serialized piece of data, representing an occurrence, which is
described structure. Our goal is to bridge this gap by normalizing
the poorly structured log data into streams of structured event described via an explicit and strict data schema. Event stream
objects. The combined challenge of this goal is structuring the is an infinite sequence of such objects adhering to the same
log data, whilst considering the high velocity with which they are
schema. This fact would allow for all of the log data to be
generated in modern IT infrastructures. This paper summarizes
the contributions of a dissertation thesis „Normalization of Un- accessed in a transparent and unified way within the notion
structured Log Data into Streams of Structured Event Objects“ of a loosely coupled event-driven architecture. As a result,
dealing with the matter at hand in detail. the log data consumers would not only be able to directly
Index Terms—log management, logging, data integration, nor- utilize the Complex Event Processing approach for advanced
malization, stream processing, monitoring correlation and monitoring queries, but it would be also possible
to research novel monitoring approaches, e.g. based on machine
I. I NTRODUCTION learning, pattern mining, or predictive modelling. All of this
Computer logs are one of the few mechanisms available for over high-quality source data.
gaining visibility into the behavior of an IT infrastructure and In its current state, log data are continuously generated
its elements. They are also considered to be one of the richest at high rates by many distributed producers using several
and most valuable sources of such behavior-related monitoring transport protocols and many heterogeneous representations.
information. However, log data are repeatedly reported to be of Moreover, a predominant portion of log entries takes the form
poor quality, mainly because a considerable portion of logs is of unstructured or semi-structured data with the main piece of
unstructured by nature. This renders them to be unsuitable for information, i.e. log message, represented as a free-form string
straightforward automated processing and analysis. In many mixing natural language with run-time context variables.
cases, even semi-structured log data can be considered sub- We propose to close this gap by the means of data normal-
optimal for direct processing, i.e. when being processed by ization, i.e. transformation and unification of data transport,
systems that expect some kind of schema to be imposed data representation, data types, and data structures resulting
on the processed data. During the operation of any modern in a common format. Normalization is a recognized data
IT infrastructure, vast floods of heterogeneous log data are integration pattern in the context of message-driven and, in turn,
generated by many distributed producers spread across the event-driven architectures. The presented dissertation thesis [1]
infrastructure’s layers and tiers. deals with multiple knowledge gaps in areas inherent to the
These facts directly clash with a vision of a new-generation normalization of heterogeneous low-grade log data into streams
event-driven monitoring system enabling its users to transpar- of properly defined event objects.

978-3-903176-15-7 © 2019 IFIP

671
A. Logging Mechanisms Currently, we are not aware of any general-purpose data
Logging is a programming practice. It is used by software transformation language that would be able to describe such log
developers to communicate information outside the scope data transformation logic, let alone provide a way to execute
of a program in order to trace its execution. Whilst the it. On the other hand, we deem the use of general-purpose
unstructured nature of log entries, i.e the respective elements programming languages unsuitable due to the limited flexibility
written into log, can be remedied with a reasonable effort, and inconvenience for domain experts. We believe that in this
the unstructured nature of log messages is stemming from the case, the use of some domain-specific language (DSL) is a
way they are created. The default logging mechanism of a proper path. This is indeed the path taken by many of so-called
vast majority of programming languages is traditionally based log data management tools, which have emerged due to the
on a simple string parameterization allowing the developers need for log data transformation and normalization.
to mix natural language with logging variables encapsulating Unfortunately, none of these tools is capable enough for
execution context. Whilst very flexible, it immensely hinders the transformation of unstructured and semi-structured log
the automated processing of log messages, which need to be data into fully structured event objects with explicitly defined
explicitly parsed in order to impose some structure on them. schemes, which primarily stems from their internal orientation
Since the existing literature dealing with the improvement on semi-structured data and from untyped nature of their DSLs.
of log data quality at their imminent source is very scarce, Therefore, we orient on researching the possibility of describing
we have decided to research the possibility of designing a and executing the above-mentioned transformations in a manner
logging mechanism allowing the developers to communicate allowing for the normalization of heterogeneous log data into
log messages in a manner resulting in the generation of fully streams of event objects.
structured log data. Due to the intended audience of this paper, III. R ESEARCH G OAL AND C ONTRIBUTIONS
this research area will not be extensively discussed here.
In terms of methodology, we follow the one of design science,
B. Log Abstraction which deals with the design and investigation of artifacts in
context, so that they can better contribute to the achievement
Log abstraction is one of the most crucial tasks in the process
of some goal that benefits its stakeholders. In the light of
of log data normalization. It addresses the unstructured nature
the above-mentioned facts, the primary research goal of the
of log messages in a reactive manner, i.e. after the log data
thesis is to improve the way log data can be represented and
are generated. Simply put, log abstraction is the separation of
accessed in order to allow the log analysis practitioners to
the static and dynamic part of the log message so that both
analyze them in a unified and interoperable manner. The main
parts can be accessed independently. The static part is referred
contributions of the thesis are represented by the design and
to as message type, which corresponds to the parameterized
evaluation of the following computer science artifacts.
log message template in logging code, and the dynamic part
corresponds to the actual logging variables and their values. Original Contributions
Regular expressions corresponding to individual message types
Design and evaluation of two prototypes of structured logging
are typically used in practice to facilitate the actual abstraction.
mechanisms for Java programming language. Both mechanisms
Log abstraction can be seen as a two-tier procedure. First, a
allow the developers to communicate structured log messages,
set of message types and corresponding matching patterns must
including their explicit data schemes, yet they differ in the
be defined/discovered, and only after that can be this pattern-set
provided flexibility and imposed overhead.
used to pattern-match each incoming log message in order to
extract dynamic information and impose some structure on it. Design and evaluation of a message type discovery algorithm
We have identified challenges in both these tiers. based on word frequency clustering, which is addressing several
deficiencies of the existing algorithms for mining historical log
C. Log Data Normalization Description and Execution data. The algorithm exhibits superior accuracy and improved
In the course of the normalization process, pattern-matching usability in practice.
and log abstraction of log messages is only one type of many Design and evaluation of a multi-pattern matching approach
different tasks that are usually needed to transform the log data based on a special trie-based data structure. The approach is
into the desired state. In our case, the desired state is represented highly-scalable with respect to the number of matching patterns,
by structured event objects. The other tasks include, but are and its prototypical implementation exhibits a very respectable
not limited to: string manipulation, e.g. whitespace trimming performance for real-world pattern sets.
or pattern replacement; structure alteration, e.g. movement or
renaming of data fields; type manipulation, e.g. type conversion Design and evaluation of a log data normalization approach
or date parsing. These tasks must usually follow some kind based on prototype-based programming consisting of a DSL
of an execution logic and a conditional execution based on with the ability to describe log data transformations in an
the content of the log data is also often needed. Moreover, object-oriented manner, and of a normalization engine with the
there must be a way to define explicit data schemes of the ability to execute these transformations, consequently resulting
transformed data in order to render it fully structured. in the log data taking the form of streams of event objects.

672 2019 IFIP/IEEE International Symposium on Integrated Network Management (IM2019): Dissertation Sessions
Table I
E XAMPLE OF MESSAGE TYPE DISCOVERY IN THE TASK OF LOG ABSTRACTION

Log Messages Message Types Regular Expressions

User Jack logged in
User John logged out
Service sshd started ⇒ User * logged * : [$1, $2]
Service * started : [$1]
⇒ User (\w+) logged (\w+)
Service (\w+) started
User Bob logged in
Service httpd started

IV. L OG A BSTRACTION – M ESSAGE T YPE D ISCOVERY We refer to the algorithm as to the Extended Nagappan-
The discovery of message types for the purposes of log Vouk (ENG) since it is based on the original idea of frequency
abstraction is a tiresome process when done entirely manually. table and intra-message word frequency proposed in [2]. Other
Log data generated by a single application or software than that, the algorithm is significantly improved in order to
project can contain hundreds of unique message types since support multiple delimiters for tokenization, support multi-word
their logging code can contain hundreds of unique logging variables, report distinct message types with no overlapping,
statements. For this reason, the research in this respect is and finally, the algorithm can be parameterized via a single
focused on automated approaches for message type discovery. parameter controlling its sensitivity and the granularity of the
In the literature, two orthogonal groups of approaches can be reported message types. The discovery algorithm is able to
identified – the message type discovery can be based either on generate pattern sets in a special format directly suitable for log
source code analysis, or on data mining techniques. abstraction via pattern matching as can be seen in Listing 1.
Since the source code of the targeted log data producer may
regexes: # regex tokens
not be always available for analysis, which is true especially INT: [integer, "[0-9]+"]
for proprietary hardware and software appliances, we have BOOL: [boolean, "\btrue\b|\bfalse\b"]
WORD: [string, "[0-9a-zA-Z]+"]
turned our attention towards approaches that discover message
types from historical log data via data mining techniques. patterns: # patterns describing the message types
grp0:
A number of works emerged in this area over the years, mt1: 'User %{WORD:var1} logged %{WORD:var2}'
utilizing different approaches primarily based either on cluster mt2: 'Service %{WORD:var1} started'

analysis or custom heuristics, eventually the combination

Listing 1: Example output of the designed algorithm
of both. We have studied the existing algorithms (and their
implementations) and identified several deficiencies, mainly in
B. Evaluation Summary
terms of their practical usability for our goals.
The accuracy evaluation of message type discovery is a
1) The algorithms often produce overlapping message types, typical task that can be achieved via classic information retrieval
i.e. it is possible for an individual log message to techniques for clustering evaluation [3]. Assuming the discovery
correspond to more than one discovered message type, does not produce overlapping message types, each discovered
which is not suitable for the purposes of log abstraction. message type induces a strict cluster of log messages in the
original data set and it is possible to calculate a number of
2) It is common for the discovery algorithms to be fine-tuned
external criteria that evaluate how well the discovered clustering
to yield the best results. However, there are algorithms
matches the gold standard classes (message types). Similarly
that do not support any fine-tuning at all, or in contrast,
to many others, we use F-measure (F1 score) as the external
provide up to 5 mostly unbounded parameters.
criterion to be used for message type discovery accuracy
3) The common step of each approach is the tokenization of evaluation. F-measure is a harmonic mean of precision and
the log messages. The studied approaches predominantly recall, other common external criteria.
use space as a fixed delimiter, unable to work with multiple
delimiters, which decreases their accuracy. Precision · Recall
F1 = 2 ·
4) The algorithms do not support multi-word variable posi- Precision + Recall
tions leading to sub-optimal results.
Thanks to the work of He et al. [4] we have been able to
A. Our Approach evaluate the accuracy of our algorithm on externally provided
In order to address the above-mentioned deficiencies, we heterogeneous log message data sets and accompanying ground
have decided to design a new message type discovery algorithm truths, which adds to the evaluation validity. Moreover, we
combining different techniques used in the studied approaches. were able to compare the algorithm’s accuracy with accuracies
The approaches used in this area take advantage of the reported for some other algorithms for message type discovery.
observation that although the log messages are free-form, they In their evaluation study, the authors used five real-life log
are generated by a limited set of fixed logging statements and message data sets ranging from supercomputers (BGL and
thus the generated log messages are likely to form clusters HPC), through distributed systems (HDFS and Zookeeper), to
with respect to variable positions. standalone desktop software (Proxifier), in order to evaluate

2019 IFIP/IEEE International Symposium on Integrated Network Management (IM2019): Dissertation Sessions 673
accuracy of four different message type discovery algorithms the matching patterns created for this purpose? Tree-based
(SLCT, IPLoM, LKE, and logSig). The data sets were randomly approaches address the problem of multi-pattern matching in a
sampled for 2000 log messages from each dataset in order to more straightforward way by organizing the matching patterns
shorten the running times of some of the more computationally in various tree-like structures with the goal of segmenting and
intensive algorithms. The ground truth (gold standard) was limiting the searched pattern-space. This tree-like organization
created manually by the authors. The reported results (F- can be either inter-pattern, i.e. organizing the individual patterns
measures) of the evaluated algorithms as well as results our as a whole with respect to some observed knowledge, or intra-
algorithm are summarized in Table II. pattern, i.e. organizing the individual pattern components, words
It can be seen that the proposed algorithm exhibits a superior for example, into a tree-like matching structure.
accuracy in an evaluation based on five real-world data sets We have designed an elegant multi-pattern matching algo-
with externally provided ground truth. When using its default rithm based on a clever intra-pattern organization that is able to
settings (ENG), the algorithm achieved very high accuracy with practically eliminate the need for multi-regex matching, whilst
an average F-measure of 0.953. When considering the best imposing only minimal limitations on the way the matching
algorithm settings for each data-set (ENG*), it exhibited an patterns can be created. The basic idea of our approach is
average F-measure of 0.996. based on organizing the pattern set into a special data structure
we refer to as regex trie (REtrie) as seen on Figure 1.
Table II
F- MEASURES FOR EVALUATED ALGORITHMS

BGL HPC HDFS Zookeeer Proxifier AVG Service˽ User˽id:˽

SLCT 0.61 0.81 0.86 0.92 0.89 0.818
IPLoM 0.99 0.64 0.99 0.94 0.90 0.892
LKE 0.67 0.17 0.57 0.78 0.81 0.600 %{STRING} %{INT}
LogSig 0.26 0.77 0.91 0.96 0.84 0.748
ENG 0.9251 0.986 1.00 0.9999 0.8547 0.953
ENG* 0.9985 0.986 1.00 0.9999 1.00 0.996
˽started ˽logged˽

V. L OG A BSTRACTION – M ULTI -PATTERN M ATCHING

Given a set of matching patterns representing individual in out
message types and an input log message, the combined goal of
pattern-matching for log abstraction is to determine if the input
fully adheres to any of the message types, and, if so, uniquely Figure 1. Regex trie (REtrie) containing three matching patterns
identify it and extract the values present on the respective
variable positions of the message type. A commonly used Trie is a tree-like data structure used for storing strings.
naïve approach is based on the iteration of the given pattern set Alongside with hash table, trie is one of the fastest data
until a match is found. However, this is infeasible for velocities structures for string retrieval when implemented properly. In
in which the log data are currently generated since there can our case the search process follows a depth-first traversal, i.e.
be hundreds or even thousands of message types in a single it backtracks when it is unable to continue on the current trie
pattern set. Therefore, scalable approaches for log message path. Thanks to the explicit priorities, the most specific patterns
abstraction based on multi-pattern matching are needed. We are tried first. In the case a path can be found from the trie
have recognized two approaches, which can be utilized for root to a node with a non-empty leaf value, a successful match
multi-pattern matching in terms of log abstraction, both based is returned, together with values captured by the regex tokens.
on limiting the searched pattern-space – multi-regex matching This way, each log message can be matched against the whole
and tree-based organization. pattern set represented by the regex trie at once.
Multi-regex matching is based on combining the respective B. Evaluation Summary
finite automata corresponding to the individual regular expres-
We have performed a series of experiments based on two
sions into an equivalent finite automaton consequently used
real-world pattern sets and also partially generated data sets in
for pattern matching. However, we have learned that in terms
order to evaluate the practical implementation of the presented
of practical multi-regex matching implementations suitable for
data structure and the related multi-pattern matching approach.
log abstraction, the situation is unsatisfactory, and there is an
The results showed that the performance of regex trie scaled
inherent complexity when implementing such an approach.
well with respect to the number of patterns (thousands) as well
A. Our Approach as the number of CPU cores. The tested Erlang implementation
consisting of mere 300 lines of code exhibited a decent speed-
In our work, we have focused on addressing the problem
up, stopping at an overall throughput of more than 1.9 million
from a different direction – what if we wanted to avoid the
abstracted log lines per second on 8 cores1 .
complexity of multi-regex matching altogether by leveraging
the specific goals of log abstraction and characteristics of 1 Intel® Xeon® CPU E5-2650 v2 @ 2.60GHz with 64GB RAM.

674 2019 IFIP/IEEE International Symposium on Integrated Network Management (IM2019): Dissertation Sessions
VI. E ND - TO -E ND L OG DATA N ORMALIZATION 1) Prototype-Based Normalization: The designed normal-
ization approach can be described as a series of object-
From the data integration perspective, normalization can be to-object transformations, which is partially based on the
performed on four different translation layers. The transport notion of prototype-based object inheritance, sometimes also
layer determines the way the data are transferred over the referred to as prototype-based programming. Prototype-based
network. The second layer determines the data representation, programming is a variant of object-oriented programming in
i.e. how the data are serialized into individual elements, conse- which new objects are created by reusing attributes of existing
quently determining if they are unstructured, semi-structured, objects, which serve as prototypes [5]. There is no notion of
or structured. The data types layer is extremely important since instantiation as in the case of class-based programming.
it defines the data types on which the domain model is based. In our approach, every piece of data intended for normaliza-
The fourth, data structures layer, describes a top-level domain tion starts as an object with a properly defined object type it
model, i.e. what logical entities will be dealt with and what belongs to. As soon as an object is created/constructed, it is
relationships will they have, if any. In terms of data integration, immutable, i.e. the object and its attributes cannot be further
the most loosely coupled outcome of normalization takes the modified. The only way to achieve such a modification is to
form of a Canonical Data Model, i.e. a common data format clone the existing object, referred to as the prototype, and per-
unifying the three top layers – the bottom layer is assumed to form a finite sequence of attribute manipulations, i.e. additions,
be based on messaging. In our case, the Canonical Data Model deletions, and transformations, which will subsequently result
is represented by structured event objects and their individual in the construction of a new immutable object that is based on
types, whose data schemes are explicitly defined. the prototype. The typed data objects that are the result of this
In the course of the normalization process, the parsing of object-to-object transformation represent the normalized event
different formats of log entries and abstraction of log messages records that can be serialized into structured event objects and
is only one type of many different tasks that are actually needed exposed as data streams.
to transform the log data into the desired state. Other common 2) Domain-Specific Language: The simple domain-specific
tasks that are somewhat inherent to the log data normalization language that implements the normalization approach presented
process include: input and output adaptation, data serialization above is based on YAML data format and the actual compila-
and deserialization, parsing, transformation, and enrichment. tion/execution logic is backed by Erlang programming language.
Whilst some of the already existing log management tools The normalization logic is described via a transformation
are quite capable and they support many of these tasks, in descriptor written in the DSL, which is then compiled into a
one form or another, they have very limited capabilities in sequence of instructions that can be executed in Erlang. During
terms of structuring the log data into event objects, which the compilation, a basic type-checking is performed and an
mainly stems from their orientation on basic semi-structured explicit type information and external schemes are generated,
data manipulations and predominantly untyped nature of their which are describing all the defined object types. This means
corresponding domain-specific languages. Although we have that it is possible to enumerate all the event/object types that can
been able to implement end-to-end log data normalization logic be yielded by the normalization process before the execution.
by using these tools, it was always at the cost of manual type 3) Normalization Engine: We have aimed at a minimalistic
enforcement, ex-post schema definition, and combination with design of the normalization engine with the goal of keeping the
additional external functionality, which was rather error-prone. necessary requirements to a bare minimum. The engine, written
in Erlang, instantiates the input adapters as per their definition in
the transformation descriptor, executes the transformation logic,
A. Our Approach
and serializes the resulting event records via data serialization
To address the problems pointed out above we have first format of choice. The engine is also responsible for schema
designed an abstract log data normalization approach that generation. The normalized event objects are then written into
allows for the data transformations to be carried out in a a messaging system via an output adapter. Currently, Apache
statically typed object-oriented paradigm, instead of being Kafka serves as the primary delivery system.
oriented on dynamically typed or untyped transformations
<137>Apr 5 19:31:10 serena audd[631]: User xtovarn logged in
of associative arrays, as it is common in practice. Then, we
have created a domain-specific language and related execution
UserSession() {
logic implementing this approach that is covering the most syslog=SyslogInfo() {
common transformation operations with specific orientation on timestamp=1459877470000, severity=1, facility=17,
hostname="serena", app_name="audd", procid=631
data lacking explicit structural information, i.e. unstructured },
and semi-structured data. Last, but not least, we have created user="xtovarn",
action="LOGIN"
a normalization engine prototype that is able to perform }
this execution logic whilst handling the tasks that are not
necessarily the responsibility of the DSL, e.g. data serialization, Listing 2: Example of an unstructured Syslog log entry with log
or timekeeping. A simple result of log data normalization, as message in natural language and a corresponding normalized
discussed throughout this paper, is illustrated by Listing 2. event object representing a successful user login

2019 IFIP/IEEE International Symposium on Integrated Network Management (IM2019): Dissertation Sessions 675
B. Evaluation Summary ACKNOWLEDGEMENTS
In a real-world setting and the context of online data process- The publication of this paper and the follow-up research
ing we consider throughput to be one of the most important was supported by the ERDF „CyberSecurity, CyberCrime and
performance metrics. We have evaluated the presented approach Critical Information Infrastructures Center of Excellence“ (No.
in terms of end-to-end throughput on real-world data sets for CZ.02.1.01/0.0/0.0/16_019/0000822).
a workload consisting mainly of log message abstraction. The
R EFERENCES
performed experiments showed that the approach is able to
normalize approximately two hundred thousand unstructured [1] D. Tovarnak, “Normalization of Unstructured Log Data into Streams
of Structured Event Objects [online],” Dissertation thesis, Masaryk
log entries per second, with the normalization engine running University, Faculty of Informatics, Brno, 2017, available from <https:
on a single commodity server, and the delivery system running //is.muni.cz/th/rjfzq/thesis-twoside-final-bw.pdf> [cit. 2018-10-28].
on three dedicated machines. The hardware setup of the [2] M. Nagappan and M. A. Vouk, “Abstracting log lines to log event types
for mining software system logs,” in 2010 7th IEEE Working Conference
benchmarking cluster is shown in Table III. on Mining Software Repositories (MSR 2010), 2010, pp. 114–117.
[3] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information
Table III Retrieval. Cambridge University Press, 2008.
H ARDWARE SETUP FOR THE CONDUCTED EXPERIMENTS [4] P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “An evaluation study on
log parsing and its use in log mining,” in 2016 46th Annual IEEE/IFIP
Node type (#) Hardware International Conference on Dependable Systems and Networks (DSN),
Benchmarking • Intel® Xeon® E5410 @ 2.33GHz 2016, pp. 654–661.
(1×) • 4 cores, 16GB RAM, SATA7.2k [5] M. Abadi and L. Cardelli, A Theory of Objects, 1st ed. Springer-Verlag
New York, Inc., 1996.
Normalization • Intel® Xeon® E5-2650 v2 @ 2.60GHz
[6] D. Tovarnak and T. Pitner, “Towards Multi-tenant and Interoperable
(1×) • 8/16 HT cores, 64GB RAM, SATA7.2k
Monitoring of Virtual Machines in Cloud,” in Proceedings of
Messaging • 2 × AMD Opteron™ 4284 @ 3.0GHz
the 14th International Symposium on Symbolic and Numeric
(3×) • 2 × 8 cores, 64GB RAM, SATA7.2k
Algorithms for Scientific Computing, ser. SYNASC ’12. IEEE
Computer Society, 2012, pp. 436–442. [Online]. Available: http:
//dx.doi.org/10.1109/SYNASC.2012.55
VII. C ONCLUSION AND F UTURE W ORK [7] D. Tovarnak, A. Vasekova, S. Novak, and T. Pitner, “Structured and
Interoperable Logging for the Cloud Computing Era: The Pitfalls and
The thesis [1] summarized in this paper represents a com- Benefits,” in Proceedings of the 2013 IEEE/ACM 6th International
prehensive material dealing with one of the richest sources of Conference on Utility and Cloud Computing, ser. UCC ’13, 2013, pp.
behavior-related monitoring information, i.e. log data. Although 91–98.
[8] D. Tovarnak, “Towards Distributed Event-driven Monitoring Architecture
the value of log data is widely recognized, so is their poor [online],” Ph.D. thesis proposal, Masaryk University, Faculty of Infor-
quality, which is rendering them unsuitable for automated matics, Brno, 2013, available from <http://theses.cz/id/0jawn5/?lang=en>
processing. In this work, we have dealt with the primarily [cit. 2017-02-02].
[9] D. Tovarnak, F. Nguyen, and T. Pitner, “Distributed Event-Driven Model
unstructured nature of log data, and especially log messages, for Intelligent Monitoring of Cloud Datacenters,” in Proceedings of the
which typically represent the most important information 7th International Symposium on Intelligent Distributed Computing, ser.
present in the generated log entries. IDC ’13. Springer International Publishing, 2014, pp. 87–92. [Online].
Available: http://dx.doi.org/10.1007/978-3-319-01571-2_11
We have addressed the matter at hand by improving the [10] F. Nguyen, D. Tovarnak, and T. Pitner, “Semantically Partitioned
quality of log data, their structure, representation, and the Peer to Peer Complex Event Processing,” in Proceedings of the 7th
way they can be accessed, by their normalization into fully International Symposium on Intelligent Distributed Computing, ser. IDC
’13. Springer International Publishing, 2014, pp. 55–65. [Online].
structured event objects with defined data schemes, which can Available: http://dx.doi.org/10.1007/978-3-319-01571-2_8
be exposed as data streams. The results related to this thesis [11] D. Tovarnak and T. Pitner, “Continuous Queries Over Distributed Streams
were published on multiple occasions [6], [7], [8], [9], [10], of Heterogeneous Monitoring Data in Cloud Datacenters,” in Proceedings
of the 9th International Joint Conference on Software Technologies -
[11], [12], [13], [14], [15]. Volume 1: ICSOFT-EA, ser. ICSOFT ’14, INSTICC. SciTePress, 2014,
The achieved results offer virtually endless possibilities with pp. 470–481.
respect to new approaches for log data analysis, correlation, [12] D. Tovarnak, “Practical Multi-Pattern Matching Approach for Fast and
Scalable Log Abstraction,” in Proceedings of the 11th International
storage, mining, pattern detection, prediction, root cause analy- Joint Conference on Software Technologies - Volume 1: ICSOFT-EA, ser.
sis, or machine learning in many application areas. In addition, ICSOFT ’16, INSTICC. SciTePress, 2016, pp. 319–329.
thanks to the proposed concepts, it is possible to implement [13] M. Cermak, D. Tovarnak, M. Lastovicka, and P. Celeda, “A Performance
Benchmark for NetFlow Data Analysis on Distributed Stream Processing
an architecture that allows for ingestion and normalization of Systems,” in Proceedings of the 2016 IEEE/IFIP Network Operations
large amounts of heterogeneous monitoring data into a central and Management Symposium, ser. NOMS ’16, 2016, pp. 919–924.
location, rendering them readily available for real-time analysis, [14] T. Jirsik, M. Cermak, D. Tovarnak, and P. Celeda, “Toward Stream-Based
IP Flow Analysis,” IEEE Communications Magazine, vol. 55, no. 7, pp.
detection, alerting, and long-term retention. 70–76, 2017.
We plan to reap the benefits of such a unified access to [15] J. Vykopal, R. Oslejsek, P. Celeda, M. Vizvary, and D. Tovarnak, “KYPO
high-quality event data in our future endeavours. One of our Cyber Range: Design and Use Cases,” in Proceedings of the 12th
International Conference on Software Technologies - Volume 1: ICSOFT,
biggest ambitions in this area is the utilization of the presented ser. ICSOFT ’17, INSTICC. SciTePress, 2017, pp. 310–321.
results for a holistic realization of the distributed event-driven
monitoring architecture for real-time security monitoring based
on information from corresponding log data producers and
other important security information sources, e.g. IP flows.

676 2019 IFIP/IEEE International Symposium on Integrated Network Management (IM2019): Dissertation Sessions

Event Processing in Event-Driven Systems
No ratings yet
Event Processing in Event-Driven Systems
12 pages
Graph Mining for Log Anomaly Detection
No ratings yet
Graph Mining for Log Anomaly Detection
11 pages
Complex Event Processing Survey Overview
No ratings yet
Complex Event Processing Survey Overview
7 pages
Study of CEP-Based RFID Data Processing Model
No ratings yet
Study of CEP-Based RFID Data Processing Model
5 pages
LogSig Generating System Events From Raw Textual Logs
No ratings yet
LogSig Generating System Events From Raw Textual Logs
10 pages
Event Log Preprocessing For Process Mining: A Review
No ratings yet
Event Log Preprocessing For Process Mining: A Review
29 pages
Research Papers of SIEM
No ratings yet
Research Papers of SIEM
6 pages
Privacy-Preserving Techniques in Process Mining
No ratings yet
Privacy-Preserving Techniques in Process Mining
16 pages
Enhancing Process Mining via Data Quality
No ratings yet
Enhancing Process Mining via Data Quality
8 pages
p1771 Georgelee vldb2012 PDF
No ratings yet
p1771 Georgelee vldb2012 PDF
10 pages
Mining Business Process Stages From Event Logs
No ratings yet
Mining Business Process Stages From Event Logs
15 pages
Peerj Cs 254
No ratings yet
Peerj Cs 254
30 pages
Automated Log Parsing Techniques
No ratings yet
Automated Log Parsing Techniques
24 pages
Enabling Automated Process Mining and Discovery
No ratings yet
Enabling Automated Process Mining and Discovery
11 pages
Anomaly Detection in Log Files Using ML
No ratings yet
Anomaly Detection in Log Files Using ML
81 pages
Scalable Process Discovery with Spark
No ratings yet
Scalable Process Discovery with Spark
12 pages
Event Label Refinement in Process Mining
No ratings yet
Event Label Refinement in Process Mining
55 pages
Wang W D 2018
No ratings yet
Wang W D 2018
136 pages
Process Mining On Distributed Data Sources: A C D e C D C B, C F D A D e
No ratings yet
Process Mining On Distributed Data Sources: A C D e C D C B, C F D A D e
33 pages
Discovering Object-Centric Petri Nets
No ratings yet
Discovering Object-Centric Petri Nets
41 pages
Data Science Methods For Declarative Process Mining
No ratings yet
Data Science Methods For Declarative Process Mining
6 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Big Data Integration for Real-Time Analytics
No ratings yet
Big Data Integration for Real-Time Analytics
20 pages
Relational Stream Processing Overview
No ratings yet
Relational Stream Processing Overview
40 pages
Big-Data Analysis of Multi-Source Logs For Anomaly Detection On
No ratings yet
Big-Data Analysis of Multi-Source Logs For Anomaly Detection On
6 pages
Zamfir 2019
No ratings yet
Zamfir 2019
6 pages
Process Mining Techniques Tutorial
No ratings yet
Process Mining Techniques Tutorial
45 pages
Understanding Event Processing Systems
No ratings yet
Understanding Event Processing Systems
1 page
Multi-Dimensional Event Data Model
No ratings yet
Multi-Dimensional Event Data Model
33 pages
05 Surveys Hirzel
No ratings yet
05 Surveys Hirzel
12 pages
IEEE-System Logs Anomaly Detection Using Deep
No ratings yet
IEEE-System Logs Anomaly Detection Using Deep
6 pages
Data Stream MG
No ratings yet
Data Stream MG
528 pages
Challenges and Directions in Security Information and Event Management SIEM
No ratings yet
Challenges and Directions in Security Information and Event Management SIEM
5 pages
Hybrid Machine Learning for Anomaly Detection
No ratings yet
Hybrid Machine Learning for Anomaly Detection
6 pages
Alohomora Unlocking Data Quality Causes Through Event Log Contex
No ratings yet
Alohomora Unlocking Data Quality Causes Through Event Log Contex
16 pages
Big Data Analytics in CPS
No ratings yet
Big Data Analytics in CPS
4 pages
Molfi A Search-Based Approach For Accurate Identification of Log Message Formats
No ratings yet
Molfi A Search-Based Approach For Accurate Identification of Log Message Formats
11 pages
Log Mine
No ratings yet
Log Mine
10 pages
1 s2.0 S1568494624000887 Main
No ratings yet
1 s2.0 S1568494624000887 Main
12 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
COBIT 5 DSS02 Process Mining Analysis
No ratings yet
COBIT 5 DSS02 Process Mining Analysis
8 pages
Event Log
No ratings yet
Event Log
2 pages
Preserving Graph Structures in Process Mining
No ratings yet
Preserving Graph Structures in Process Mining
16 pages
Managing Inconsistencies in Data Exchange
No ratings yet
Managing Inconsistencies in Data Exchange
24 pages
Challenges and Directions in Security Information and Event Management (SIEM)
No ratings yet
Challenges and Directions in Security Information and Event Management (SIEM)
5 pages
SQL Query Clustering Similarity Metrics
No ratings yet
SQL Query Clustering Similarity Metrics
13 pages
Temporal Data Exchange and Repair
No ratings yet
Temporal Data Exchange and Repair
19 pages
Complete Doc - Lavanya
No ratings yet
Complete Doc - Lavanya
95 pages
Real-Time Twitter Anomaly Monitoring System
No ratings yet
Real-Time Twitter Anomaly Monitoring System
34 pages
Clio: From Prototype to Industrial Tool
No ratings yet
Clio: From Prototype to Industrial Tool
7 pages
Incremental Log Format Mining Method
No ratings yet
Incremental Log Format Mining Method
8 pages
Hypothesis-Based Digital Forensics
No ratings yet
Hypothesis-Based Digital Forensics
190 pages
Sensors 24 02636 v2
No ratings yet
Sensors 24 02636 v2
30 pages
Drain Log Parser
No ratings yet
Drain Log Parser
12 pages
CEP in Distributed Systems
No ratings yet
CEP in Distributed Systems
28 pages
Xu Sosp09
No ratings yet
Xu Sosp09
16 pages
Final Project Document
No ratings yet
Final Project Document
59 pages
Real-Time Data Stream Processing Engine
No ratings yet
Real-Time Data Stream Processing Engine
13 pages
Dokumen - Pub - Mathematical Foundations of Data Science Using R 9783110565027 3110565021
No ratings yet
Dokumen - Pub - Mathematical Foundations of Data Science Using R 9783110565027 3110565021
431 pages
SYBCA Syllabus (NEP 24) - Final
No ratings yet
SYBCA Syllabus (NEP 24) - Final
38 pages
Project Planning in National Development
No ratings yet
Project Planning in National Development
40 pages
CS508 SOLVED MCQs FINAL TERM BY JUNAID
100% (1)
CS508 SOLVED MCQs FINAL TERM BY JUNAID
54 pages
APP Midsem
No ratings yet
APP Midsem
199 pages
Functional Programming Languages Overview
No ratings yet
Functional Programming Languages Overview
35 pages
Midterm
No ratings yet
Midterm
7 pages
Health Insurance for Unorganized Workers
No ratings yet
Health Insurance for Unorganized Workers
7 pages
Programming With Scheme
No ratings yet
Programming With Scheme
9 pages
Computerized Hostel Allocation System
No ratings yet
Computerized Hostel Allocation System
4 pages
Calculus of Inductive Constructions
No ratings yet
Calculus of Inductive Constructions
18 pages
C Programming Course Overview
No ratings yet
C Programming Course Overview
8 pages
PPL UNIT 5 Notes-1
No ratings yet
PPL UNIT 5 Notes-1
17 pages
Accreditation Guide
No ratings yet
Accreditation Guide
62 pages
Server-Side Scripting Course Overview
No ratings yet
Server-Side Scripting Course Overview
6 pages
Cambridge Assessment International Education: Computer Science 9608/21 May/June 2018
No ratings yet
Cambridge Assessment International Education: Computer Science 9608/21 May/June 2018
14 pages
Time Table For Winter 2024 Theory Examination
No ratings yet
Time Table For Winter 2024 Theory Examination
8 pages
0860 Computing Scheme of Work - Stage 8 - v1 - tcm143-635635
No ratings yet
0860 Computing Scheme of Work - Stage 8 - v1 - tcm143-635635
106 pages
Programming Language Pragmatics: Michael L. Scott
No ratings yet
Programming Language Pragmatics: Michael L. Scott
35 pages
IEEE Transactions On Knowledge and Data Engineering March 1989
No ratings yet
IEEE Transactions On Knowledge and Data Engineering March 1989
9 pages
Principles of Computing Science Guide
No ratings yet
Principles of Computing Science Guide
4 pages
C Programming Course Overview and Outcomes
No ratings yet
C Programming Course Overview and Outcomes
8 pages
0860 Computing Scheme of Work Stage 7 v2 - tcm143-635634
75% (4)
0860 Computing Scheme of Work Stage 7 v2 - tcm143-635634
115 pages
File Management and Classification Systems
No ratings yet
File Management and Classification Systems
1 page
Functional Programming in Scheme
No ratings yet
Functional Programming in Scheme
24 pages
2 Syntax Directed Transiation
No ratings yet
2 Syntax Directed Transiation
9 pages
Functional Programming
No ratings yet
Functional Programming
4 pages
English Language Infant Ecd - Grade 2
No ratings yet
English Language Infant Ecd - Grade 2
32 pages
Programming Language Evaluation
No ratings yet
Programming Language Evaluation
29 pages
Programming Languages: Application and Interpretation
100% (1)
Programming Languages: Application and Interpretation
376 pages

Log Data Normalization for Event Streams

Uploaded by

Log Data Normalization for Event Streams

Uploaded by

Normalization of Unstructured Log Data into

Streams of Structured Event Objects

978-3-903176-15-7 © 2019 IFIP

Log Messages Message Types Regular Expressions

analysis or custom heuristics, eventually the combination

BGL HPC HDFS Zookeeer Proxifier AVG Service˽ User˽id:˽

V. L OG A BSTRACTION – M ULTI -PATTERN M ATCHING

You might also like