Anomaly_Detection_and_Enterprise_Security_using_User_and_Entity

The document presents a framework for User and Entity Behavior Analytics (UEBA) aimed at detecting anomalies in user behavior to enhance enterprise security. It highlights the increasing threat of insider attacks and the need for effective detection methodologies, utilizing machine learning and data analysis techniques. The paper discusses the challenges of processing large datasets and aims to provide solutions for real-time monitoring and anomaly identification to mitigate security risks.

Uploaded by

Nurila Tusbaeva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Anomaly_Detection_and_Enterprise_Security_using_User_and_Entity

Uploaded by

Nurila Tusbaeva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

2022 3rd International Conference on Innovations in Computer Science & Software Engineering (ICONICS)

Anomaly Detection and Enterprise Security using

User and Entity Behavior Analytics (UEBA)

Muhammad Zunair Ahmed Khan Muhammad Mubashir Khan
Department of Computer Science & IT Department of Computer Science & IT
NED University of Engineering & Technology NED University of Engineering & Technology
Karachi, Pakistan Karachi, Pakistan
[email protected] [email protected]

Junaid Arshad
School of Computing & Digital Technology
Birmingham City University
Birmingham, UK
[email protected]

complexity of business requiring user authority and creden-

Abstract—Digital frauds are made possible by a lack of tials, access rights, payment cards, and other security cre-
transparency and other security flaws in a system. Consequently, dentials, making compliance more challenging, is what this
it has grown to be the most pervasive problem in the world. When
these frauds emerge from within businesses, they are referred to problem is characterized by. Informational breaches can cost
as insider threats and may lead to severe consequences. There organisations both money and reputation. Ponemon Institute
have been various frameworks proposed to lessen this problem, study mentions the average employee negligence cost around
however, transparency still remains a challenge. Conventionally, $310,000. The cost of malicious insider is $760,000, whereas
storing data in chronological order to prevent data manipulation credential theft can be the most expensive at $870,000 [1].
is one technique to ensure traceability and security. In this paper,
we present a framework based on User Entity and Behavioral Therefore, developing a reliable UEBA framework will sig-
Analysis (UEBA) approach to study user profiles over time and nificantly contribute to the security of sensitive systems.
classify them as normal or aberrant. The proposed framework
utilises additional information including IP addresses, location
data, and the users’ organizations etc. We focus on applying B. Research Questions
data science and analytical methods to create data visualizations 1) How to account for anomaly detection without ground
for analysis and anomaly identification.
Index Terms—anomaly detection, UEBA, enterprise security, truth data?
user behaviour, entity analytics 2) How to devise a methodology for handling huge amount
of data?
I. I NTRODUCTION 3) How to suppress false positive in large scale enterprise
environment?
One of the most intriguing business frontiers of the twenty-
first century is UEBA. User and Entity Behavioral Analytics
(UEBA) is classified as a risk management system in use. It C. Contribution
employs machine learning, algorithms, and statistical analysis This research aims to contribute to devising a robust
to detect real-time network threats. This advancement and its methodology for the detection of anomalous behaviour. The
unique feature help manage an organization’s user security. work presented shows promising results in profiling a large
Furthermore, continuous monitoring with real-time access number of users. The detection methodology enables the
decreases unauthorized access and abnormal user behaviour company management to process the user-specific data to
intentionally or mistakenly, allowing for a secure environment. detect any insider threat.
However, the considerable amount of data generated by each This paper is structured as follows: An overview of back-
user is an issue as processing data from hundreds of users is ground knowledge is described in Section 2. Section 3 high-
computed intensive. This poses a problem for accurate time lights a literature review of the existing research. Section 4
detection. It also makes it challenging to scale in an enterprise explains the features of Carnegie Mellon University (CMU)
environment. Dataset. Section 5 discusses the proposed system and the
techniques involved. Section 6 evaluates the results. The
A. Problem Statement interactive analysis interface is illustrated in Section 7. Section
A serious security risk is posed by unauthorized access, 8 concludes the research, and Section 9 highlights the future
use, and modification of personal information. The increasing direction of this research.

978-1-6654-6141-2/22/$31.00 ©2022 IEEE

Authorized licensed use limited to: Astana IT University. Downloaded on January 22,2025 at 11:42:04 UTC from IEEE Xplore. Restrictions apply.
II. BACKGROUND 2) Negligent Insider: An organization is put at risk by a
Insider threats pose a huge risk to all critical infrastructure careless insider’s negligence. Negligent insiders are aware of
sectors. This section also discusses the distinguishable char- security policies, but they choose to disregard them, putting
acteristics, actions, and motives of insider threats. the firm in danger. Human error plays the primary role in
causing significant amounts of damage to an organization’s
A. Insider Threats assets when it comes to unintentional insider threats [9].
An insider is any individual who has or had permitted, 3) Malicious Insider: Insiders that are malicious do harm
authorized, or privileged access or knowledge of an organi- to a company for personal gain or financial gain. They’re
zation’s resources, data, assets, etc. The number of insider typically described as a purposeful insider. The motivation is
threats is increasing, which emphasizes the overall importance either personal gain or harming the organization.
of further research analysis. An insider may be one or all of C. User & Entity Based Behavior Analysis (UEBA)
the following:
User and entity behavior analytics (UEBA) is a cyber-
• A trusted individual
security solution that employs machine learning (ML) to
• Employees and those to whom the organization has given
detect anomalies in the behavior of corporate network routers,
sensitive information and access servers, and endpoints [10]. It seeks to identify any unusual
• Employees that have a computer and/or network access
or suspicious behavior instances in which there have been
• Someone who designs products or is responsible for the
irregularities in daily patterns or usage. For example, if an
services an organization provides. employee on the company network regularly downloads files
• Individuals that have the knowledge of an organizations
of 10 MB every day but suddenly begins downloading 5GB of
strength, weaknesses, opportunities it is seeking or threats files, the UEBA system will detect an anomaly and either alert
(competitors). etc. an IT administrator or detach that employee from the network
1) General Characteristics of Insiders: Many previously automatically. UEBA does more than just monitor human
held studies have shown that employees who tend to partic- behavior; it also monitors machines. For example, a company
ipate in an insider attack showcase predictable personality, server in one branch office may receive more requests than
characteristics, and behavioral traits [2]–[4]. According to a usual, indicating the beginning of a potential attack [11]. IT
research study held in the year 2012 [5], if organizations administrators may fail to notice this sort of behavior, but
can identify these traits they will be capable of developing UEBA will recognize it and take appropriate action.
additional protocols for protection to decrease the possibility UEBA is a more exhaustive version of UBA because it
of an insider attack. Different case studies targeted the IT de- includes entities such as routers, servers, and endpoints or
partment of numerous organizations to monitor the behavioral, devices. Gartner [12] in the October of the year 2017 added the
psychological, and social characteristics of insiders [2], [6]– extra ”E” in UBA to help the security industry understand that
[8]. Following are the key points observed during different entities along with users to identify persistent insider-related
studies: threats in an organization because both user and entity activity
• Age, ethnic backgrounds, and race affiliations are not key are correlated.
identifiers for an insider. People belonging to different age
groups, ethnic backgrounds, and races were involved in III. R ELATED W ORK
insider attacks. With the recurrence and effect of the rise in data breaches,
• The majority of the insiders hold records of being con- it has become fundamental for companies to automate in-
victed criminals (i.e. they have previous arrest records). trusion detection systems (IDS) through machine learning-
• Ex-employees and current employees both are found to based solutions. This usually accompanies difficulties, such
be involved in this act. Employees who are frustrated with as fashionable unevenness, changing objective concepts, and
their organization or organization’s policy are most likely hardships in directing a sound assessment. In this research
to be insiders. There is a 60% chance that employees who [13], the researchers adopted a user-centred anomaly detection
served in an organization for less than 5 years are going promise to address selective challenges of intrusion detection
to act as an insider. through real-world use cases in identity and access manage-
• Around 88% of permanent employees are involved in ment (IAM). Researchers of 2022 [14] presented an insider
insider-related attacks. threat detection model ITDBLA based on LSTM-Attention.
They extracted user & role behavioral features, user behavioral
B. Categorization of Insider Threat sequences, and psychological data from various sources of
This section covers the three main categories of insider heterogeneous log files for determining the everyday behaviors
threats. of the users.
1) Compromised Insider: An organizational member whose The so-called insiders perform data exploitation that is
account credentials have been breached or whose system is majorly identified as a common vector for cyberattacks. Re-
compromised may facilitate an attacker to acquire unrestricted cent research works cover this area in the technological,
access to sensitive or private company systems or assets. psychological, and sociotechnical contexts. The 2022 research

Authorized licensed use limited to: Astana IT University. Downloaded on January 22,2025 at 11:42:04 UTC from IEEE Xplore. Restrictions apply.
[15] particularly analyzed unintended insider threat forms and study [28] focuses on providing a cyber security culture frame-
documented the results obtained after a series of detailed CDM work primarily considering human factors to detect both types
(Critical Decision Method) directed interviews with the ones of potential insider threats(malicious and unintentional). An
who encounter several types of unwitting security breaches. ingenious and effective anomaly detection & acknowledgment
This work also focused on the factors that primarily contribute system is proposed in this research [29]. An asset-compelled
to completing day-to-day tasks. device is utilized to recognize oddity. A cloud-based two-
The social engineering attack used by hackers for stealing stream brain network is utilized for detailed anomaly inves-
the credentials of a user is known as Phishing. Organized tigation. In this work, they presented an effective and robust
research [16] on this technique along with Email scams, the structure to perceive inconsistencies from observing Big Video
researchers developed an IDS Chrome extension to identify Data (BVD) utilizing Artificial Intelligence of Things (AIoT).
real-time phishing after analyzing the URL, domain, content, Smart surveillance is significant to the use of AIoT and we
and page attributes of a URL enduring in an Email or any part propose a two-stream neural network for this path. This paper
of the web page. They invented a lightweight and proactive [30] proposed a customized unified peculiarity identification
rule-based incremental approach to identify any unidentified structure for network traffic irregularity discovery, in which
phishing URLs. This framework is capable to detect zero-day information is collected under the reason of security assurance
and spear phishing attacks efficiently. and somewhat customized models are built by calibrating.
The paper [17] suggested a new multilayer framework for A research study proposed profound learning (DL)- based
insider threat detection. The upper layer of this framework oddity recognition framework made out of assessment and
selects the most suitable insider threats detection classify- order models applied to a subdomain in medical services
ing model among several depending upon the multi-criteria frameworks alluded to as Diabetes Management Control Sys-
decision-making techniques. The selection process is devel- tem (DMCS) [31]. The assessment model was utilized to gauge
oped after integrating the entropy-VIKOR techniques. The the glucose level of patients at every assessment time step,
lower layer used the random forest algorithm for creating the while the characterization model is planned to distinguish
Misuse Insider Threat Detection (MITD) model to propose a peculiar pieces of information. In addition, taking into account
hybrid insider threat detection method. that the dataset contains delicate physiological data of the
This research [18] proposed a novel Cryptography and patients, this paper executes the autonomous learning (IL)
Machine learning-based Authentication Protocol (CMAP) to and combined learning (FL) strategies to keep up with client
create a secure data exchange environment for federated information security. In view of the examination results, the FL
cloud server users. It is basically an online threat detector strategy showed a higher review rate ( = 98.69%) than the IL
developed at a cloud server using baseline as an ensemble technique ( = 97.87%). Furthermore, the FL-supported CNN-
Voting Classifier for mitigating DoS attacks and other security based abnormality identification framework performs better
breaches. The proposed protocol is analyzed against numerous compared to the MLP-based approach. Insider danger location
attacks such as credentials (ID & password) leakage, session is quite difficult for security in associations. Existing strategies
key computation, user anonymity, insider, middleman, client to recognize insider dangers depend on psycho-physiological
impersonation, replay, third-party impersonation, and forward elements, measurable investigation, AI and profound learning
secrecy attacks. techniques. They depend on predefined controls or put away
Although it has been in use for some time, Gartner’s marks and neglect to identify new or obscure assaults. To beat
Security and Risk Management Summit recognized UEB as a portion of the limits of the current strategies, a new research
a risk management solution in 2016 [19]. Despite increased proposed conduct based insider danger recognition technique
corporate interest in user behavior analysis, security practition- [32]. The proposed technique is tried utilizing CMU-CERT
ers remain skeptical, with machine learning being deployed insider danger dataset for its presentation. The proposed tech-
in only a few real-world implementations [20]–[23]. There nique beats on the accompanying measurements: exactness,
are heuristic methods for finding a subset of ”pure” points in accuracy, review, f-measure, and AUC-ROC boundaries. The
a dataset and removing outliers. It hunts out sites with the insider danger discovery results show a critical improvement
minimum determinant in their covariance matrix iteratively over existing techniques.
[24], [25]. Security examiners require modern devices that permit them
The authors of [11] illustrated how and why machine to investigate and recognize client movement that could be
learning technology may be used to correct mistakes and demonstrative of an approaching danger to the association. In
allow a vital new security component. However, people are this work [33], researchers talked about the difficulties related
still required to investigate incidents, determine if they are with distinguishing insider danger movement, alongside the
damaging, and offer more forensic information. According to devices that can assist with combating this issue. Researchers
[11], machine learning has the ability and future proclivity to exhibited their methodology utilizing the CERT Dataset. The
recognize, analyze, and respond to insider threats. The authors principal advantage insiders have over external sources is their
of [26] used additional constraints on policy structure as knowledge of the inside system to sidestep known security
variables. According to the author’s research [27], IP addresses checks and stay hidden. This paper [34] centers around insider
can be used to track user activity and location. This research danger identification through conducting an investigation of

Authorized licensed use limited to: Astana IT University. Downloaded on January 22,2025 at 11:42:04 UTC from IEEE Xplore. Restrictions apply.
clients. A progression of occasions and exercises are broken browsing behavior. It contains records of existing employees’
down to highlight determination to productively recognize visiting different employment websites looking for jobs, and
ill-disposed conduct. A profound learning-based approach is reports any sign of unsatisfied employees that are planning to
suggested that distinguishes insiders with more prominent leave the company [36]. Logon.csv file maintains the logon
exactness and low bogus positive rate. CMU CERT r4.2 and logoff activities of employees. This file contains 5 data
dataset is utilized in this research. fields including employee id, activity date, user, pc (system),
and activity an employee performed. Psychometric.csv file in
IV. CMU DATASET r4.2 provides big 5 personality traits or character scores for
The lack of real-world test data is one of the biggest hurdles users. Psychometric data are normally recorded by the HR
researchers have to face when investigating causes of insider department of an organization. The LDAP directory contains
threats. Businesses are rarely inclined to share attack data to files documenting the list of employees. Every file, includes 4
safeguard their privacy. For this reason, researchers rely on fields, employee name, user id, email, and role. Summary of
artificial or synthetic threat datasets such as Insider Threat files included in r4.2 CERT dataset is shown in table I
Dataset developed by Computer Emergency Response Team 3) Datasets Scenarios: Following are the three primary
(CERT Division) at Carnegie Mellon University [35]. scenarios that version r4.2 of the CERT dataset has:
1) Evolution of dataset: CERT developed a collection of
1) Scenario1: A user who has never used removable drives
synthetic Insider Threat Test Datasets with the collaboration
or worked after work hours suddenly begins to log in
of ExactData and with sponsored support from DARPA. These
after work hours. Starts using a removable media drive,
datasets were designed to be part of a project at CMU
and starts uploading private data to WikiLeaks before
(Carnegie Mellon University). These datasets are designed
leaving the organization.
with artificial test data that mimics the behavior of a real-life
2) Scenario2: A user begins to look for jobs and starts con-
insider threat that an organization may face. Insider Threat
tacting an organization’s competitors for employment.
Dataset is assembled using numerous interdependent systems
That same user also beings to use a thumb drive/media
that mimic a virtual organization to create log behaviors.
drive to steal company data before leaving the company.
It is a collection of artificial threats that delivers carefully
3) Scenario3: The system administrator downloads a key-
manufactured data of background and malicious actors.
logger and transfers it to his supervisor’s or manager’s
2) Explanation of files: Insider Threat Test Datasets con-
machine using a removable thumb drive. The next day,
tains total 14 files. Datasets are arranged according to the data
the administrator logs in as his supervisor or manager
generator release that assembled them. Each dataset has a
and sends an alarming email. That email caused panic
Readme file that furnishes precise notes about the features
throughout the organization after which he (system ad-
included in that particular release. The answers.tar.bz2 is
ministrator) immediately leaves the organization.
the answer key file in the dataset that holds the details of
the malicious activity included in each dataset. It contains
information such as explanations of each scenario enacted and V. P ROPOSED S YSTEM
the ids of the manufactured users. The proposed methodology provides user centered anomaly
Insider Threat Test Datasets contains various releases of detection. The data has been transformed into the time series
datasets to choose from. For this research we selected release analysis problem. The research exhibits the utility of non-
r4.2 since it contains multiple instances of each scenario, parameterized technique for distinguishing anomalous be-
and numerous users are involved in each scenario. CERT haviours in the retrospective data. This section encompass the
r4.2 is used for training and testing purposes consisting of pre-processing involved in the proposed UEBA solution as
the normal and malicious activity of 1000 users recorded well as the feature extraction techniques and considerations.
over the period of 18 months from 2010-2011. All events The details of algorithm and experiment formulation are
are recorded in a separate CSV file. CERT r4.2 consists presented along with the result ensembling technique.
of logins,logouts, connected devices, disconnected devices,
website visits, psychometric data, emails (sentiments catego-
A. Feature Extraction
rized as a positive or negative activity), file open events, file
close events, organizational structure, and user information The user activity is recorded in log files. These logs file
records. The r4.2 dataset is comprised of seven distinct parts are industry standard and generated by security, network and
which are device.csv, email.csv, file.csv, http.csv, logon.csv, access management applications. Simplifying our problem we
psychometric.csv, and LDAP (folder). transformed the logs into frequency or number of occurrences.
Device.csv records the behavior of file access on the de- The login logs are used to extract number of logins within of-
vice. File.csv logs the data of files copied to and from a fice hours and outside office hours, email logs provide number
removable media. Email.csv file contains records of different of emails sent within company and to outside members, device
email communications between employees. 5 data fields that logs are used for usage of removable device and similarly
are included in the .csv file are id, date, too, and from. file and http logs are processed to obtain number of files
The Http.csv file in r4.2 retains data to track employees downloaded and websites visited in a day.

Authorized licensed use limited to: Astana IT University. Downloaded on January 22,2025 at 11:42:04 UTC from IEEE Xplore. Restrictions apply.
TABLE I
R 4.2 CERT DATASET

CMU-CERT V4.2 Insider Threat Dataset

Logon.csv Log of login and logoff activity
Device.csv Log of connecting and disconnecting a thumb drive
File.csv Log of copying files to any removable media device
Files in the Dataset http.csv Log of internet browsing history
Email.csv Log of email communication
Details of users within the organization for each month.
LDAP
Folder contains 18 .csv files.
Psychometric.csv Contains log of 5 personality traits.
Fields of the .csv Files Field id, date, user, pc, activity (logon/logoff, connect/ disconnect, url, email id).
Collection Period 18 Months
930 non-malicious users
Total Users 1000 Users
70 insiders/malicious users
322762804 non-malicious instances
Total Instances 322770227 instances
7423 malicious instances

Fig. 1. Overview of architectural diagram of the system

TABLE II Algorithm 1 STOMP(T, m)

F EATURE SET USED IN THE PROPOSED APPROACH . 1: n :=← Length(T ), l :=← n − m + 1;
2: µ, σ :=← ComputeM eanStd(T, m)
Feature and Description 3: QT :=← SlidingDotP roduct(T [1 : m], T ), QTf irst :=← QT
Log Source 4: D :=← CalculateDistanceP rof ile(QT, µ, σ, 1) // see Eq 1
F1 :Frequency of weekday logons in normal timing 5: P :=← D, I :=← ones // initialization
F2 : Frequency of weekday logon after office hours 6: for i = 2 to l // in-order evaluation
Login
F3 : Frequency of weekend logon 7: for j= l downto 2 // update dot product, see (4)
F4 : Frequency of Logoff 8: QT [j] :=← QT [j − 1] − T [j − 1]T [i − 1] + T [j + m − 1]T [i + m − 1]
F5 : Frequency of Email inside company 9: endf or
Email
F6 : Frequency of Email outside company 10: QT [1] :=← QTf irst [i]
F7 : Frequency of connecting device in office hours 11: D :=← CalculateDistanceP rof ile(QT, µ, σ, i) // see Eq 1
F8 : Frequency of connecting device after office hours 12: P, I :=← ElementW iseM in(P, I, D, i)
Device
F9 : Frequency of connecting device on weekend 13: endf or
F10 : Frequency of disconnecting device 14: returnP, I
F11. Frequency of downloading exe file 15: End
F12 : Frequency of downloading jpg file
F13 : Frequency of downloading zip file
File
F14 : Frequency of downloading Doc file
F15 : Frequency of downloading Pdf file Profile (STOMP) [38], SCRIMP [39] etc. The methodology
F16 : Frequency of downloading Other file
Http F17 : Count of urls visited opted in this work is implemented via STOMP algorithm.
The matrix profiling is applied to the complete feature map
obtained in the previous section. The STOMP algorithm works
B. Matrix Profiling on sub-sequence joins, provisioning A-B joining for baselines
of motif or anomaly detection. Since the proposed system does
The algorithm provides spatio-temporal clustering of the not work on predefined user behavior baselines, the distance
hidden patterns in the feature map. In the presented work, calculation is performed using sliding window dot product of
matrix profiling algorithm is configured for incremental com- self join. The dot product is obtained for the time series data
putation, and leverages GPU for time series data analysis. for each feature corresponding to all users Algorithm: 1:3. The
1) STOMP Algorithm: The matrix profiling for time serise resultant dot product is used to calculate the distance profile
data mining can be performed using several algorithms includ- of two time series subsequences Ti ,m and Tj ,m Algorithm:
ing Scalable Time series Anytime Matrix Profile (STAMP) 1:4
[37], Time Series subsequences All-Pairs-Similarity-Search The STOMP algorithm has a computational complexity of
(TSAPSS) [37], Scalable Time series Ordered-search Matrix O(n2 )

Authorized licensed use limited to: Astana IT University. Downloaded on January 22,2025 at 11:42:04 UTC from IEEE Xplore. Restrictions apply.
2) Distance Profile: For simplicity, distance profile used in be 5 days. This value was found empirically. The peaks in the
the presented work was chosen to be z-normalized Euclidean resultant matrix profile are indicative of the discords namely
distance given in Eq: 1 the anomalies as shown in figure 3 and the minima shows the
s conserved sub sequences also called as motifs.
QTi,j − mµi µj
di,j = 2m(1 − ) (1)
mσi σj
here m is the window size i.e the length of subsequence.
µi is the mean of series Ti ,m and µj is the mean of
series T j, m. Similarly, the standard deviation of Ti ,m is
given by σi and the standard deviation of Tj ,m is given by σj .

3) Resultant Series Matrix Profile: The resultant of

STOMP algorithm is the matrix profile of the dataset given by
Pi ,m . The profile length is equal to length of input time series.
Fig. 3. Matrix Profile Discord/Anomaly
The profiling obtained at Algorithm 1 indicates a magnitude
of zero in the matrix profile for non anomalous behavior. The
change in user activity patterns are thus indicted by peaks of D. Post Processing & Suppression of False Positives
variable magnitude in the resultant matrix profile as shown The matrix profile for each feature was obtained using
in figure 2. The changes in the user behaviour across the the stumpy implementation of matrix profiling. The system
multiple features are further post processed to obtain indicators presented in this work makes an ensemble of the feature
of anomalous activity. profiles. The anomalies in the individual features were found
to be jittery and prone to false positives. The system sam-
ples profiles across multiple features and produce the global
anomaly pattern in if it finds overlapping discords across
multiple features. Figure 4

Fig. 2. Matrix Profile for Anomalous and Normal URL Activity

C. Stumpy
The presented work makes utility of a python based im-
plementation of the matrix profile algorithm. The Stumpy
[40] library provides prallalization of the computation and
leverages the utility of hardware accelerators. Stumpy makes it Fig. 4. Ordered anomaly search across multiple features
easier to analyze the huge untenable time-series data. Stumpy
provides user and data agnostic implementation of the various
matrix profile algorithms. Stumpy generates analyzable and VI. R ESULTS AND E VALUATION
actionable insights on which the detection algorithm works The CERT CMU dataset considered for the research work
in near real time to raise indicators of behavior variations. contains 70 known malicious insiders. This section presents
The system is devised to ingest the data from various sources the detection of these users as suspicious users using the
for entire day’s activity and generate insights for each user proposed methodology.
The stumpy implementation uses the numba [41] (jit) compiler
to optimize the computational speed and parallelization. The A. Performance Metrics
presented work uses the stump function for discord discovery. The ground truth from the CERT Insiders dataset helps to
The experimentation takes as input the time series data of 500 calculate the performance metrics. The imbalanced data has
days. Stumpy aids in the retrieval of discord patterns from the 70 anomalous users among the 1000 system users. This limits
matrix profile. The matrix profile is array of of distance profile our choice of performance metrics that are relevant.
for each sub-sequence of sliding window. The window size i.e Research problem where missed detection can be costly.
the size of sub-sequence for anomaly discovery was tuned to Like for fraud, tumor and threat detection. F1 score is the

Authorized licensed use limited to: Astana IT University. Downloaded on January 22,2025 at 11:42:04 UTC from IEEE Xplore. Restrictions apply.
most interpretable performance metric with high value of There is no learning involved and only hyper parameter
recall. A good recall score prevents us from marking positive is the widow size that can be empirically set.
samples as negative. Thus allowing good detection of insiders. 2) How to devise a methodology for handling huge
Our algorithm does suffer from high false positive. This is amount of data?
visible from the low precision score that also impacts f1-score. The proposed system uses an optimized implementation
A weighted ensemble of user activity can reduce the false of the distance profile methodology. The time required
positives. to profile is O(n2 )

TABLE V
S CALIBILITY C OMPARISION

Brute-Force STAMP STOMP

O(n2 m) O(n2 logn ) O(n2 )

3) How to suppress false positive in large scale en-

terprise environment? As discussed in literature the
characteristics of an insider are found to reflect in more
than one features that are under the consideration of this
Fig. 5. Matrix Profile Performance Metrics work. The problem formulation in this research work
indicates tat the false positive detection of a user as an
insider can be reduced by ordered search across multiple
TABLE III feature. Since the categorization of the user as malignant
C ONFUSION M ATRIX L OGON
or benign is outside analytics engine, the anomaly detec-
Predicted Insider Predicted Benign tion algorithm can further be fine tuned by the individual
Actual Insider TP = 62 FN = 8 enterprise as per the requirement of their environment.
Actual Benign FP = 158 TN = 772
The graph 6 represents the computational reputation of
the chosen algorithm amongst its contending flavors
B. Comparative Analysis
Though different researches report accuracy on the CMU
dataset. It is not a very good performance comparative. The
low samples of insiders can result in high accuracy. Even for
a naive model that detects a sample as normal all the time. It
can report an accuracy of 93 %. In our research false positives
reduce the accuracy of matrix profile. This can be fixed with
tuning the ensemble of user activities with weightage. This
will allow different value to each activity. Allowing to suppress
noisy activities.

TABLE IV Fig. 6. Performance Comparison for scalability [40]

P ERFORMANCE COMPARISON WITH EXISTING APPROACHES USING THE
CMU CERT DATASET V 4.2.
VII. I NTERACTIVE A NALYSIS I NTERFACE
Method Accuracy
Random Forest with Randomization [42] 94.00 The UEBA research study provides an interactive web
Random Forest [42] 90.00 application. This section discusses the usage of the user
LSTM-RNN [43] 93.85 interface for anomaly detection. It allows for both automated
DBN-OCSVM [44] 87.79
Graph Convolutional Networks GCN [45] 94.50
and manual analysis. This feature allows for the ability to drill
Convolutional Neural Networks CNN [46] 96.34 down the detection results. Allowing for better explainability.
Proposed Matrix Profile 83.40 The web app contains three pages. It has a landing page with
sidebar navigation accessible throughout the web app. The
Analyze page has tab functions to assist manual or automated
C. Answers to Research Questions analysis. The ”Answer & Results” tab allows for comparative
1) How to account for anomaly detection without analysis.
ground truth data? The final page contains the results of the algorithm run on
As implied by the method discussed in chapter 5, the the entire CMU dataset. The page presents us with the correct
Matrix profiling does not need any ground truth data. and incorrect detection. Along with F1-score, precision and
Since the algorithm uses time series data. Computing recall. A table to highlight the false positives and negatives
distances over windowed subsections of the time series. allows for detailed analysis.

Authorized licensed use limited to: Astana IT University. Downloaded on January 22,2025 at 11:42:04 UTC from IEEE Xplore. Restrictions apply.
A. Executive dashboard
The dashboard provides insights into user behaviour. Al-
lowing for the ability to analyze individual users. It further
breaks downs the analysis into separate activities of the user.
This allows for analyzing anomalous activity’s strength and
veracity. The computation of the matrix profile on separate
activity allows for greater result explainability. The page
also allows for viewing the actual and predicted result for a
particular user.
Fig. 9. Forensic Analysis

retains interpretability of the detection the proposed system

makes. It helps to reduce false positive by calculating matrix
profile on individual dataset. The system marks the user
activity as anomalous after many activities show anomalous
behaviour. This section provides the conclusive remarks as
well as highlights the future scope of the research area.
IX. F UTURE W ORK
Fig. 7. Analyze Page
A. Psychometric scoring & LDAP
The user psychometric information and organization’s
LDAP data logs were not included in this study. However the
B. Manual and Automated Analysis
psychometric data can be utilized for a comprehensive analysis
The human analysis tab helps to explore the activity to and predicting the the drift of user before hand.
detect the user as an insider or normal. The ability to explore
the activities helps understand their contributions to the result. B. Actual URL visited
The UEBA tab computes the matrix profile and plots the data The data used provisions the use of advance topic discovery
over the original activity. This plot helps to understand the algorithms to better understand the context of visited urls,
working of the matrix profiles in detail. Marking how a change Furthermore the utility of URL reputation data as well as
in activity triggers change in matrix profiles. company’s blacklist list of domains can be incorporated to
better segment the malicious user population from the normal
legitimate user.
C. Amount of data transferred device
The scope of dataset is limited in the regard of data transfer
volumes. However the empirical analysis of insider behavior
suggest that the volume of data involved in website and device
data is of significance in the segregation of the normal user
behavior from a anomalous one.
X. L IMITATIONS
The research objectives are limited to the identification of
Fig. 8. Automated Analysis
insiders with malicious intend or compromised users. The
identification of insiders with dynamic behaviour is beyond the
scope of work presented. Additionally the work can analyze
C. Forensics Mechanism the anomaly in user behaviour after the malicious activity
The last page displays key dataset metrics. An interactive has occurred. More contextual information can be used to
table of results helps to identify true and false positives. The predict tendency of a user to perform anomalous activity
table also displays the actual and predicted date and time of the before incident.
anomalous activity. This helps to identify mistimed detection.
ACKNOWLEDGMENT
VIII. C ONCLUSION The authors would like to acknowledge National Center of
The research presents algorithm to ease anomaly detection Cyber Security, for providing the research avenue. Also they
and insider threat mitigation. The use of matrix profile allows would like to pay gratitude to Birmingham City University,
for detection with least compute resources. This allows the UK & CSIT Department NEDUET for providing facility and
algorithm to scale well with increase in user data. It also assistance during this research.

Authorized licensed use limited to: Astana IT University. Downloaded on January 22,2025 at 11:42:04 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [26] M. Touma, E. Bertino, B. Rivera, D. Verma, and S. Calo, “Framework
for behavioral analytics in anomaly identification,” in Ground/Air Mul-
tisensor Interoperability, Integration, and Networking for Persistent ISR
[1] “Insider threat statistics for 2022: Facts and figures,” Aug
VIII, vol. 10190. SPIE, 2017, pp. 92–101.
2022. [Online]. Available: https://www.ekransystem.com/en/blog/
[27] R. Yousef, “Measuring the effectiveness of user and entity behavior
insider-threat-statistics-facts-and-figures
analytics for the prevention of insider threats.”
[2] A. McCormac, K. Parsons, and M. Butavicius, “Preventing and profiling [28] A. Georgiadou, S. Mouzakitis, and D. Askounis, “Detecting insider
malicious insider attacks,” 2012. threat via a cyber-security culture framework,” Journal of Computer
[3] M. D. Waters, “Identifying and preventing insider threats,” 2016. Information Systems, vol. 62, no. 4, pp. 706–716, 2022.
[4] CISA, “Combating the insider threat.” [Online]. Available: https: [29] W. Ullah, A. Ullah, T. Hussain, K. Muhammad, A. A. Heidari, J. Del Ser,
//www.cisa.gov/uscert/security-publications/Combating-Insider-Threat S. W. Baik, and V. H. C. De Albuquerque, “Artificial intelligence
[5] M. McBride, L. Carter, and M. Warkentin, “Exploring the role of indi- of things-assisted two-stream neural network for anomaly detection in
vidual employee characteristics and personality on employee compliance surveillance big video data,” Future Generation Computer Systems, vol.
with cybersecurity policies,” RTI International-Institute for Homeland 129, pp. 286–297, 2022.
Security Solutions, vol. 5, no. 1, p. 1, 2012. [30] J. Pei, K. Zhong, M. A. Jan, and J. Li, “Personalized federated learning
[6] C. Colwill, “Human factors in information security: The insider threat– framework for network traffic anomaly detection,” Computer Networks,
who can you trust these days?” Information security technical report, vol. 209, p. 108906, 2022.
vol. 14, no. 4, pp. 186–196, 2009. [31] P. V. Astillo, D. G. Duguma, H. Park, J. Kim, B. Kim, and I. You,
[7] A. Cummings, T. Lewellen, D. McIntire, A. P. Moore, and R. Trzeciak, “Federated intelligence of anomaly detection agent in iotmd-enabled
“Insider threat study: Illicit cyber activity involving fraud in the us diabetes management control system,” Future Generation Computer
financial services sector,” 2012. Systems, vol. 128, pp. 395–405, 2022.
[8] L. F. Fischer, “Espionage: why does it happen?” Defense Security [32] M. Singh, B. Mehtre, and S. Sangeetha, “Insider threat detection based
Institute, http://www. hanford. gov/files. cfm/whyhappens. pdf, pp. 10–3, on user behaviour analysis,” in International Conference on Machine
2000. Learning, Image Processing, Network Security and Data Sciences.
[9] CISA, “Defining insider threat.” [Online]. Available: https://www.cisa. Springer, 2020, pp. 559–574.
gov/defining-insider-threats [33] P. A. Legg, “Visualizing the insider threat: challenges and tools for
[10] Fortinet, “What is ueba?” [Online]. Available: https://www.fortinet. identifying malicious user activity,” in 2015 IEEE Symposium on Visu-
com/resources/cyberglossary/what-is-ueba alization for Cyber Security (VizSec). IEEE, 2015, pp. 1–7.
[11] J. Graves, “How machine learning is catching up with the insider threat,” [34] R. Nasir, M. Afzal, R. Latif, and W. Iqbal, “Behavioral based insider
Cyber Security: A Peer-Reviewed Journal, vol. 1, no. 2, pp. 127–133, threat detection using deep learning,” IEEE Access, vol. 9, pp. 143 266–
2017. 143 274, 2021.
[12] T. B. T. P. Avivah Litan, Gorka Sadowski, “Market guide for [35] CMU, “Insider threat dataset.” [Online]. Available: https://kilthub.cmu.
user and entity behavior analytics,” 2018. [Online]. Available: edu/articles/dataset/Insider Threat Test Dataset/12841247
https://www.gartner.com/en/documents/3872885 [36] A. Nicolaou, S. Shiaeles, and N. Savage, “Mitigating insider threats
[13] M. Garchery, “User-centered intrusion detection using heterogeneous using bio-inspired models,” Applied Sciences, vol. 10, no. 15, p. 5046,
data,” Ph.D. dissertation, Universität Passau, 2020. 2020.
[37] C.-C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. A. Dau, D. F.
[14] X. ZUO, F. YAN, B. HOU, Z. CHEN, and Y. GUO, “Insider threat
Silva, A. Mueen, and E. Keogh, “Matrix profile i: all pairs similarity
detection model of power system based on lstm-attention,” vol. 84, 2022.
joins for time series: a unifying view that includes motifs, discords and
[15] N. Khan, R. J Houghton, and S. Sharples, “Understanding factors shapelets,” in 2016 IEEE 16th international conference on data mining
that influence unintentional insider threat: a framework to counteract (ICDM). Ieee, 2016, pp. 1317–1322.
unintentional risks,” Cognition, Technology & Work, vol. 24, no. 3, pp. [38] Y. Zhu, Z. Zimmerman, N. Shakibay Senobari, C.-C. M. Yeh, G. Fun-
393–421, 2022. ning, A. Mueen, P. Brisk, and E. Keogh, “Exploiting a novel algorithm
[16] M. SatheeshKumar, K. Srinivasagan, and G. UnniKrishnan, “A and gpus to break the ten quadrillion pairwise comparisons barrier
lightweight and proactive rule-based incremental construction approach for time series motifs and joins,” Knowledge and Information Systems,
to detect phishing scam,” Information Technology and Management, pp. vol. 54, no. 1, pp. 203–236, 2018.
1–28, 2022. [39] Y. Zhu, C.-C. M. Yeh, Z. Zimmerman, K. Kamgar, and E. Keogh,
[17] M. N. Al-Mhiqani, R. Ahmad, Z. Z. Abidin, K. H. Abdulkareem, M. A. “Matrix profile xi: Scrimp++: time series motif discovery at interac-
Mohammed, D. Gupta, and K. Shankar, “A new intelligent multilayer tive speeds,” in 2018 IEEE International Conference on Data Mining
framework for insider threat detection,” Computers & Electrical Engi- (ICDM). IEEE, 2018, pp. 837–846.
neering, vol. 97, p. 107597, 2022. [40] S. M. Law, “STUMPY: A Powerful and Scalable Python Library for
[18] A. K. Singh and D. Saxena, “A cryptography and machine learning Time Series Data Mining,” The Journal of Open Source Software, vol. 4,
based authentication for secure data-sharing in federated cloud services no. 39, p. 1504, 2019.
environment,” Journal of Applied Security Research, vol. 17, no. 3, pp. [41] “Numba.” [Online]. Available: https://pypi.org/project/numba/
385–412, 2022. [42] D. Noever, “Classifier suites for insider threat detection,” 2019.
[19] A. Litan, “Forecast snapshot: User and entity behavior analytics, [Online]. Available: https://arxiv.org/abs/1901.10948
worldwide, 2017,” 2017. [Online]. Available: https://www.gartner.com/ [43] F. Meng, F. Lou, Y. Fu, and Z. Tian, “Deep learning based attribute
en/documents/3621357 classification insider threat detection for data security,” 2018 IEEE Third
[20] A. Pinto, “Secure because math: A deep-dive on machine learning-based International Conference on Data Science in Cyberspace (DSC), pp.
monitoring,” Black Hat Briefings, vol. 25, no. 1-11, p. 2, 2014. 576–581, 2018.
[21] K. Rieck, “Computer security and machine learning: Worst enemies or [44] L. Lin, S. Zhong, C. Jia, and K. Chen, “Insider threat detection
best friends?” in 2011 First SysSec Workshop. IEEE, 2011, pp. 107– based on deep belief network feature representation,” 2017 International
110. Conference on Green Informatics (ICGI), pp. 54–59, 2017.
[22] C. Gates and C. Taylor, “Challenging the anomaly detection paradigm: [45] Z.-H. Zhou and X.-Y. Liu, “Training cost-sensitive neural networks with
A provocative discussion,” in Proceedings of the 2006 workshop on New methods addressing the class imbalance problem,” IEEE Transactions on
security paradigms, 2006, pp. 21–29. Knowledge and Data Engineering, vol. 18, no. 1, pp. 63–77, 2006.
[23] R. Sommer and V. Paxson, “Outside the closed world: On using machine [46] R. G. Gayathri, A. Sajjanhar, and Y. Xiang, “Image-based feature
learning for network intrusion detection,” in 2010 IEEE symposium on representation for insider threat classification,” Applied Sciences,
security and privacy. IEEE, 2010, pp. 305–316. vol. 10, no. 14, 2020. [Online]. Available: https://www.mdpi.com/
[24] P. J. Rousseeuw, “Least median of squares regression,” Journal of the 2076-3417/10/14/4945
American statistical association, vol. 79, no. 388, pp. 871–880, 1984.
[25] P. J. Rousseeuw and K. V. Driessen, “A fast algorithm for the minimum
covariance determinant estimator,” Technometrics, vol. 41, no. 3, pp.
212–223, 1999.