Automatically Detect and Analyze Security Incidents Using Machine Learning Algorithms.
Automatically Detect and Analyze Security Incidents Using Machine Learning Algorithms.
Abstract. Machine Learning is AI’s brain, a type of algorithm that enables computers to
analyze data, learn from past experiences, and make decisions, all in a way that resembles
human behavior. In this paper, various machine learning techniques for fraud detection are
discussed and their performance on various data sets examined. The algorithms discussed in
this paper allow machines to make predictions based on patterns or rules identified from the
dataset and learn the relationships within the data provided.
1 Introduction
2 Literature Survey
Regression
Regression detects correlations between different datasets and understand how they
are related to each other. You can use regression to predict system calls of operating
systems, and then identify anomalies by comparing the prediction to an actual call. In
Machine Learning, various kinds of algorithms allow machines to learn the
relationships within the data provided and make predictions based on patterns or rules
identified from the dataset. So, regression is a machine learning technique where the
model predicts the output as a continuous numerical value. Regression analysis is an
integral part of any forecasting or predictive model, so is a common method found in
machine learning powered predictive analytics. Alongside classification, regression is
a common use for supervised machine learning models. This approach to training
models required labelled input and output training data. Machine learning regression
models need to understand the relationship between features and outcome variables,
so accurately labelled training data is vital. Regression is a supervised learning
technique which helps in finding the correlation between variables and enables us to
predict the continuous output variable based on the one or more predictor variables. It
is mainly used for prediction, forecasting, time series modeling, and determining the
causal-effect relationship between variables.
Some of the most common regression techniques in machine learning can be
grouped into the following types of regression analysis:
1. Logistic Regression
2. Simple Linear Regression
3. Multiple Linear Regression
Logistic Regression
Logistic regression models can be used to predict the probability of a
dependent variable occurring. Logistic regression is used when the dependent
variable can have one of two values, such as true or false, or success or failure.
Generally, the output values must be binary. A sigmoid curve can be used to map
the relationship between the dependent variable and independent variables. It is
mainly used for Classification problems, predict categorical dependent variable
with the help of independent variables.
Simple Linear Regression
Simple Linear Regression is a type of Regression algorithms that models
the relationship between a dependent variable and a single independent variable.
The relationship shown by a Simple Linear Regression model is linear or a sloped
straight line, hence it is called Simple Linear Regression. The key point in Simple
Linear Regression is that the dependent variable must be a continuous value.
However, the independent variable can be measured on categorical or continuous
values. Simple Linear regression algorithm has mainly two objectives:
Forecasting new observations and Model the relationship between the two
variables.
Multiple Linear Regression
Multiple Linear Regression can be defined as one of the important
regression algorithms which models the linear relationship between a single
dependent continuous variable and more than one independent variable. Basically
it provides extension of Simple Linear regression as it takes more than one
predictor variable to predict the response variable.
Clustering
Clustering identifies similarities between datasets and groups them based on their
common features. Cluster analysis divides data into meaningful or useful groups
(clusters). If meaningful clusters are our objective, then the resulting clusters should
capture the “natural” structure of the data. Cluster analysis is only a useful starting
point for other purposes, e.g., data compression or efficiently finding the nearest
neighbors of points. Whether for understanding or utility, cluster analysis has long
been used in a wide variety of fields: psychology and other social sciences, biology,
statistics, pattern recognition, information retrieval, machine learning, and data
mining. In this chapter we provide a short introduction to cluster analysis. We present
a brief view recent technique, which uses a conceptbased approach. In this case, the
approach to clustering high dimensional data must deal with the “curse of
dimensionality”[19].
Classification
This section describes very commonly used security datasets for mobile and computer
networks. The next provides the basics of cyber attacks and threats.
Security Datasets
Machine learning techniques produce better results if the datasets have diversity and
collected real-time data. In this sub-section, we will discuss the most used security
datasets. Frequently used security datasets are the Defense Advanced Research
Project Agency (DARPA) datasets, URL dataset, KDD Cup 99 dataset, Australian
Defense Force Academy (ADFA) dataset, HTTP CSIC-2010, Android malware
dataset, Android validation dataset, Spambase, and NSL-KDD. The primary outcome
of the DARPA dataset is the detection of the attack[3]. The URL dataset consists of
five different types of malicious URLs: phishing URLs, spam URLs, malware URLs,
benign URLs, and defacement URLs. The Android malware dataset is an android
apps-based dataset. Android malware dataset was proposed to blacklist malware
android applications [21]. The Android validation dataset was generated to find
various relations between 72 real apps by extracting two types of features: metadata
and N-grams. The Android validation dataset shows that there are different
relationships between apps, for example, siblings, false siblings, step-siblings and
cousins [22].
DARPA dataset is a network traffic and audit logs-based dataset. It has its
limitations to handle new system variations. DARPA does not show real-world
network traffic of data [23]. AFDA dataset was developed to get the better of the
DARPA dataset. AFDA overcame the limitations to handle new system variations
[24]. KDD Cup 99 dataset was formed using a subset of DARPA dataset. The later
advancement for KDD Cup 99 dataset is NSL-KDD dataset [25].
There are several attacks and threats on cyberspace. Common threats to cyberspace
are spam classification, malicious URL, fraud detection, phishing, malware detection,
disabling firewall and antivirus, logging of keystrokes.
Several defense mechanisms have been installed on network systems to detect
unauthorized intrusion and probing. Cybercriminals can scan computer networks for
vulnerabilities. There are three categories of intrusion detection based on network
analysis such as signature-based, anomaly-based and hybrid-based. Signature-based
techniques are used to detect the known attacks, whereas anomaly-based detection
detects any unusual behavior within the network. Hybrid-based detection is a
combination of both detection techniques. There are four categories of cyber-attacks,
namely user to root (U2R), remote to local (R2L), probing, and denial of service
(DOS). If a user tries to get access rights of a root/admin user, then this attack is
called U2R. In contrast, if a remote user tries to gain access as a local user, then the
attack is classified as R2L. Whereas, if a legitimate user is denied to the system access
by making the network resources busy, then the phenomena is called DOS. However,
in the case of probing, cybercriminals only scan the network to find weak areas for
future attacks[3].
Cybersecurity to make sure that the data remains private and is not hacked or leaked
for all the world. Machine Learning has many applications in Cyber Security
including identifying cyber threats, improving available antivirus software,
fighting cyber-crime that also uses AI capabilities, and so on.
An antivirus protects your system by scanning any new files on the network to
identify if they might match with a known virus or malware signature. However,
this traditional antivirus requires constant upgrades to keep up with all the upgrades
in the new viruses and malware being created. That’s where machine learning can
be extremely helpful. Antivirus software that is integrated with machine learning
tries to identify any virus or malware by its abnormal behaviour rather than its
signature. In this way, it can manage threats that are common and previously
encountered and also new threats from viruses or malware that were recently
created. For example, Cylance a software company has created a smart antivirus
that learns how to detect viruses or malware from scratch and thus does not depend
on identifying their signatures to detect them.
4.3 User Behavior Modeling
Some cyber threats can attack a particular company by stealing the login credentials
of any of their users and then illegally logging into the network. This is very
difficult to detect by normal antivirus as the user credentials are authentic and the
cyberattack may even happen without anyone knowing. Here, machine learning
algorithms can provide help by using user behaviour modelling. The machine
learning algorithm can be trained to identify the behaviour of each user such as their
login and logout patterns. Then any time a user behaves out of their normal
behavioural method, the machine learning algorithm can identify it and alert the
cybersecurity team that something is out of the ordinary. Of course, some changes
in user behaviour patterns and entirely natural but this will still help in catching
more cyber threats than conventional methods. For example, there is a cyber
security software provided by Darktrace that uses machine learning to identify the
normal behavioural patterns of all the users in a system by analysing the network
traffic information.
Many hackers are now taking advantage of technology and using machine learning
to find the holes in security and hack systems. Therefore, it is very important that
companies use machine learning for cybersecurity as well. This might even become
the standard protocol for defending against cyber attacks as they become more and
more tech-savvy. Take into account the devastating NotPetya attack that utilized
EternalBlue, a software hole in Microsoft’s Windows OS. These types of attacks
can get even more devastating in the future with the help of artificial intelligence
and machine learning unless cybersecurity software also uses the same technology.
An example of this is Crowdstrike, a cybersecurity technology company that uses
Falcon Platform which is a security software imbued with artificial intelligence to
handle various cyber attacks.
The basic idea of Machine learning is about training models to learn automatically
from large amounts of data, and from the learning, a system can then identify spot
anomalies, trends, execute actions and ultimately make recommendations. Only
machine learning can address the increasing number of challenges in cybersecurity:
detecting unknown attacks and detecting advanced attacks, scaling up security
solutions, including polymorphic malware.
Day to day organizations are dealing with bulks of data packets pass through
firewalls. Even if only 0.1% of the data is not properly categorized by machine
learning, then we can block huge amounts of normal traffic that would severely
impact the business. It’s understandable that in the early days of machine learning,
some organizations were concerned that the models wouldn’t be as accurate as human
security researchers. It takes time, and it also takes huge amounts of data to actually
train a machine learning model to get up to the same level of accuracy as a really
skilled human. Humans, however, don’t scale and are among the uncommon
resources in IT today. We are relying on ML to efficiently scale up the cybersecurity
solutions. Also, ML can help us detect unknown attacks that are hard for humans to
detect, as ML can build up baseline behaviors and detect any abnormalities that
deviate from them.
5.2 The access to large volumes of training data, especially labeled data
Machine learning requires a large amount of data to make models and predictions
more accurate. Gaining malware samples is a lot harder than acquiring data in image
processing and NLP. There is not enough attack data, and lots of security risk data is
sensitive and not available because of privacy concerns.
5.3 The ground truth
With new devices getting connected to enterprise networks all the time, it’s not easy
for an IT organization to be aware of them all. Machine learning can be used to
identify and profile devices on a network. That profile can determine the different
features and behaviors of a given device.
With data and application in many different locations, being able to identify trends
across large volumes of devices is just not humanly possible. Machine learning can do
what humans cannot, enabling automation for insights at scale.
In the digital world, threats are everywhere. A threat is the risk of undesired or even
more dangerous activity that causes damage. In this paper we have addressed various
steps of Cyber Threat Intelligence Cycle. The below Fig. 1 shows various phases in
the cycle.
Direction – Based on the entity value and the potential impacts of asset loss or
service interruption are assessed in this first phase. Many questions arise that include:
what needs protecting & why? What types of TI information is required? Who will be
receiving the TI and how? Answers to these questions are the cornerstone of the
whole intelligence program and inform the development of guidelines for data
collection methods and resource assignments.
Processing – The resultant lake of raw data from Collection is not usable alone,
because there is simply too much and it is not in a common form. During the
Processing stage, the data is formatted to be understood by, and suited for the user.
Analysis and Production – The data now is usable, goals of the organization should
be reconsidered to enhance the threat information. Various analysis techniques are
used to determine if suspicious behaviors are correlated and relevant. Context and
priority is added, turning the data into finished intelligence.
Dissemination– The threat Intelligence is now ready to be shared with the user, either
through a report, feed, or automated platform. The security team will use the TI to
build and act on priority plans for mitigation and proactive protection, focusing on
alerts of the highest importance or impact to their organization.
Feedback – After any alert to threat event, it is critical to re-analyze the security goals
of the organization. Are there too many or too few alerts? Is there a different type of
data we need? Is the TI actionable? Further, by redirecting assets or pivoting in a new
direction, organizational efficiency is constantly refined.
Ransomware attacks are critical for individual users but more so for businesses who
can’t access the data for running their daily operations. Ransomware attacks have
become popular in the last few years and pose one of India’s most prominent Cyber
Security challenges in 2020. According to the Cyber Security firm Sophos, about 82%
of Indian organizations were hit by ransomware in the last six months. Ransomware
attacks involve hacking into a user’s data and preventing them from accessing it until
a ransom amount is paid.
According to IoT Analytics, there will be about 11.6 billion IoT devices by 2021. IoT
devices are computing, digital, and mechanical devices that can autonomously
transmit data over a network. Examples of IoT devices include desktops, laptops,
mobile phones, smart security devices, etc. Safeguarding IoT devices is one of the
biggest challenges in Cyber Security, as gaining access to these devices can open the
doors for other malicious attacks.
8.3 Cloud Attacks
Most of us today use cloud services for personal and professional needs. Also,
hacking cloud platforms to steal user data is one of the challenges in Cyber Security
for businesses. If an attack like iCloud hack is carried out on enterprise data, it could
pose a massive threat to the organization and maybe even lead to its collapse.
Phishing is a type of social engineering attack often used to steal user data, including
login credentials and credit card numbers. Unlike ransomware attacks, the hacker,
upon gaining access to confidential user data, doesn’t block it. Instead, they use it for
their own advantages, such as online shopping and illegal money transfer. Phishing
attacks are prevalent among hackers as they can exploit the user’s data until the user
finds out about it. Phishing attacks remain one of the major challenges of Cyber.
While blockchain and cryptocurrency might not mean much to the average internet
user, these technologies are a huge deal for businesses. Thus, attacks on these
frameworks pose considerable challenges in Cyber Security for businesses as it can
compromise customer data and business operations. These technologies have
surpassed their infancy stage but have yet not reached an advanced secure stage.
Thus, several attacks have been attacks, such as DDOS, Sybil, and Eclipse, to name a
few. Organizations need to be aware of the security challenges that accompany these
technologies and ensure that no gap is left open for intruders to invade and exploit.
Even the most advanced software has some vulnerabilities that might pose significant
challenges to Cyber Security in 2020, given that the adoption of digital devices now is
more than ever before. Individuals and enterprises don’t usually update the software
on these devices as they find it unnecessary. However, updating your device’s
software with the latest version should be a top priority. An older software version
might contain patches for security vulnerabilities that are fixed by the developers in
the newer version. Attacks on unpatched software versions are one of the major
challenges of Cyber Security. These attacks are usually carried out on a large number
of individuals, like the Windows zero-day attacks.
While Machine Learning and Artificial Intelligence technologies have proven highly
beneficial for massive development in various sectors, it has its vulnerabilities as well.
These technologies can be exploited by unlawful individuals to carry out cyberattacks
and pose threats to businesses. These technologies can be used to identify high-value
targets among a large dataset. Machine Learning and AI attacks are another big
concern in India.
While most challenges of Cyber Security are external for businesses, there can be
instances of an inside job. Employees with malicious intent can leak or export
confidential data to competitors or other individuals. This can lead to huge financial
and reputational losses for the business. These challenges of Computer Security can
be negated by monitoring the data and the inbound and outbound network traffic.
Installing firewall devices for routing data through a centralized server or limiting
access to files based on job roles can help minimize the risk of insider attacks
9 Conclusion
In this paper we have discussed various cyber threats to analyze and detect them
against different data sets, organizations. Several ML algorithms like regression,
clustering, classification have discussed to work on data sets and prevent from attacks.
Also discussed different cyber security threats.
References
1. Daniel S. Berman, Anna L. Buczak , Jeffrey S. Chavis and Cherita L. Corbett.: A Survey of
Deep Learning Methods for Cyber Security. Information 2019
2. Lili Zhang, Himanshu Vashisht , Andrey Totev, Nam Trinh1 and Tomas Ward.: A
comparison of distributed machine learning methods for the support of “many labs”
collaborations in computational modeling of decision making. Frontiers 2022.
3. Kamran Shaukat, Suhuai Luo, Vijay Varadharajan, Ibrahim A. Hameed, Shan Chen, Dongxi
Liu and Jiaming Li.: Energies, 2020.
4. Buczak, L.; Guven, E.: A Survey of Data Mining and Machine Learning Methods for Cyber
Security. IEEE Commun. Surv. Tutor. 2016, 18, 1153–1176.
5. Xin, Y.; Kong, L.; Liu, Z.; Chen, Y.; Li, Y.; Zhu, H.; Gao, M.; Hou, H.; Wang, C.: Machine
Learning and Deep Learning Methods for Cybersecurity. IEEE Access 2018, 6, 35365–
35381.
6. Chathurika S Wickramasinghe, Daniel Marino, Kasun Amarasinghe Milos Manic.:
Generalization of Deep Learning for Cyber-Physical System Security: A Survey.
ResearchGate 2018.
7.Shah, N.F.; Kumar, P. A comparative analysis of various spam classifications. In Progress in
Intelligent Computing Techniques: Theory, Practice, and Applications; Springer:
Berlin/Heidelberg, Germany, 2018; pp. 265–271.
8. Chandrasekar, C.; Priyatharsini, P. Classification techniques using spam filtering email. Int.
J. Adv. Res. Comput. Sci. 2018, 9, 402.
9. Rishabh Das, Thomas Morris.: Machine Learning and Cyber Security. ResearchGate 2017.
10. Said A. Salloum, Muhammad Turki Alshurideh, Ashraf M Elnagar, Khaled Shaalan.:
Machine Learning and Deep Learning Techniques for Cybersecurity: A Review.
ResearchGate 2020.
11. Alex Mathew.: Machine Learning in Cyber-Security Threats. ICICNIS 2020.
12. Jordan, M.I., Mitchell, T.M.: Machine learning: Trends, perspectives, and prospects.
Science (80-.) 349(6245), 255–260 (2015).
13. Manjeet Rege, Raymond Blanch K. Mbah.: Machine Learning for Cyber Defense and
Attacks. The seventh international conference on data analytics, 2018, ISBN: 978-1-61208-
681-1.
14. Katanosh Morovat, Brajendra Panda.: A Survey of Artificial Intelligence in Cybersecuirty.
International Conference on Computational Science and Computational Intelligence, 2020.
15. Shiravi, A.; Shiravi, H.; Tavallaee, M.; Ghorbani, A.A. Toward developing a systematic
approach to generate benchmark datasets for intrusion detection. Comput. Secur. 2012, 31,
357–374. [CrossRef]
16. L. Huang, A. D. Joseph, B. Nelson, B. Rubinstein, and J. D. Tygar. : Adversarial machine
learning. In 4th ACM Workshop on Artificial Intelligence and Security, AISec, 2011, pages
43–57, Chicago, IL, USA, October 2011.
17. Gonzalez, H.; Stakhanova, N.; Ghorbani, A.A. Droidkin: Lightweight detection of android
apps similarity. In Proceedings of the International Conference on Security and Privacy in
Communication Networks, Orlando, FL, USA, 23–25 October 2009; pp. 436–453.
18. Panigrahi, R.; Borah, S. A detailed analysis of CICIDS2017 dataset for designing Intrusion
Detection Systems. Int. J. Eng. Technol. 2018, 7, 479–482. Energies 2020, 13, 2509 23 of
27
19. Xie, M.; Hu, J.; Slay, J. Evaluating host-based anomaly detection systems: Application of
the one-class SVM algorithm to ADFA-LD. In Proceedings of the 2014 11th International
Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Xiamen, China, 19–21
August 2014; pp. 978–982.