0% found this document useful (0 votes)
99 views17 pages

Comparative Research On Network Intrusion Detection Methods Based

This document discusses comparative research on network intrusion detection methods based on machine learning. It reviews the application of machine learning algorithms for intrusion detection over the past 10 years, including traditional machine learning, ensemble learning, and deep learning. The document then conducts comparative experiments on several machine learning algorithms using two datasets and compares their detection accuracy, F1 score, AUC, and other indicators. The results show that ensemble learning algorithms generally perform better, while deep learning algorithms did not significantly outperform but require further hyperparameter tuning.

Uploaded by

Ad Astra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views17 pages

Comparative Research On Network Intrusion Detection Methods Based

This document discusses comparative research on network intrusion detection methods based on machine learning. It reviews the application of machine learning algorithms for intrusion detection over the past 10 years, including traditional machine learning, ensemble learning, and deep learning. The document then conducts comparative experiments on several machine learning algorithms using two datasets and compares their detection accuracy, F1 score, AUC, and other indicators. The results show that ensemble learning algorithms generally perform better, while deep learning algorithms did not significantly outperform but require further hyperparameter tuning.

Uploaded by

Ad Astra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Computers & Security 121 (2022) 102861

Contents lists available at ScienceDirect

Computers & Security


journal homepage: www.elsevier.com/locate/cose

Comparative research on network intrusion detection methods based


on machine learning
Chunying Zhang a, Donghao Jia a, Liya Wang a,∗, Wenjie Wang a, Fengchun Liu b, Aimin Yang a
a
College of Science. North China University of Science and Technology. China
b
Qianan College. North China University of Science and Technology. China

a r t i c l e i n f o a b s t r a c t

Article history: Network intrusion detection system is an essential part of network security research. It detects intru-
Received 23 February 2022 sion behaviors through active defense technology and takes emergency measures such as alerting and
Revised 2 July 2022
terminating intrusions. With the rapid development of machine learning technology, more and more re-
Accepted 24 July 2022
searchers apply machine learning algorithms to network intrusion detection to improve detection effi-
Available online 28 July 2022
ciency and accuracy. Due to the different principles of various algorithms, they also have their advan-
Keywords: tages and disadvantages. To construct the dominant algorithm model in the field of network intrusion
Network intrusion detection detection and provide the accuracy value, this paper systematically combs the application literature of
Machine learning machine learning algorithms in intrusion detection in the past ten years. A review is made from three cat-
Deep learning egories: traditional machine learning, ensemble learning, and deep learning. Then, this paper selects the
Comparative experiment KDD CUP99 and NSL-KDD datasets to conduct comparative experiments on decision trees, Naive Bayes,
support vector machines, random forests, XGBoost, convolutional neural networks, and recurrent neu-
ral networks. The detection accuracy, F1, AUC, and other indicators of these algorithms on different data
sets are compared. The experimental results show that the effect of the ensemble learning algorithm is
generally better. The Naive Bayes algorithm has low accuracy in recognizing the learned data, but it has
obvious advantages when facing new types of attacks, and the training speed is faster. The deep learning
algorithm is not particularly prominent in this experiment, but its optimal results are affected by the
structure, hyperparameters, and the number of training iterations, which need further in-depth study. Fi-
nally, the main challenges facing the current network intrusion detection field are summarized, and the
future research directions have been prospected.
© 2022 Elsevier Ltd. All rights reserved.

1. Introduction itoring data analysis report shows that in the first half of 2020,
there were about 19,0 0 0 counterfeit pages on domestic websites,
The wide application of information technology and the rise about 18,0 0 0 overseas IP addresses implanted backdoors on about
and development of cyberspace have promoted the prosperity and 35,900 websites in China, and about 74,0 0 0 websites have been
progress of the economy and society, but they have also brought tampered with within China, including 318 government websites
new security risks and challenges. Cyberspace security concerns that have been tampered with. The number of DDoS attacks on
the common interests of humanity, world peace and development, the cloud platform accounted for 76.1% of the domestic targets
and national security of all countries (Liu, 2017). The advent of attacked. The number of implanted backdoor links accounted for
the digital age means that everything will be interconnected, and 90.3% of all domestically planted backdoor links. The number of
the entire world will be built on software. Where there is Inter- industrial equipment exposed on the Internet reached 4630, and
net software, there will be loopholes, network infrastructure will it continued to be scanned and sniffed more than 20,0 0 0 times a
become more complex, and the attack surface will expand in- day from abroad (National Computer Network Emergency Response
finitely. Network security attacks will show advanced and large- Technical Team/Coordination Center of China (CNCERT/CC) 2020).
scale characteristics. All financial and technological wars are, in the It can be seen that cyberspace security is facing a considerable
final analysis, network wars. China Internet network security mon- threat, and it is necessary to increase efforts to improve technol-
ogy for efficient defense. There are two main types of network se-
curity technologies. One is passive prevention technology such as

Corresponding author.
data encryption, identity authentication, firewall, etc. The other is
E-mail address: [email protected] (L. Wang). active prevention technology, mainly network intrusion detection,

https://doi.org/10.1016/j.cose.2022.102861
0167-4048/© 2022 Elsevier Ltd. All rights reserved.
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

Fig. 1. The application process of machine learning in abnormal intrusion detection.

which can detect some abnormal network behaviors and deal with analysis and compares these algorithms through experiments. The
them accordingly. Active defense technology based on network in- main contributions of this paper are divided into the following
trusion detection has become an important research topic in net- points.
work security.
The network intrusion detection system is divided into two 1 This review introduces the application of common machine
modes according to the behavior of intrusion detection, anomaly learning algorithms in intrusion detection in the past ten years.
detection and signature detection. Anomaly detection needs to es- It provides solutions and ideas for subsequent research on
tablish a model for accessing normal behaviors and determine problems such as data imbalance and incremental learning in
behaviors that do not conform to this model as intrusions. The intrusion detection.
core of this detection method lies in how to define the so-called 2 This review conducts comparative experiments on several types
"normal" situation. On the contrary, signature detection needs to of machine learning algorithms in intrusion detection and pro-
summarize all possible unfavorable and unacceptable behaviors to vides data support for constructing the dominant algorithm
establish a model. The behavior that conforms to this model is model of intrusion detection.
judged as an intrusion. This method mainly evaluates whether the 3 This review discusses the challenges facing the intrusion detec-
event characteristics that violate the security policy are in the col- tion field and directions for future development.
lected data. The core technology is to maintain a knowledge base
(Wu, 2019). 2. Intrusion detection system and application
Machine learning is a science of artificial intelligence, which
mainly studies how to use data or previous experience to achieve Intrusion Detection System (IDS) refers to a series of devices or
algorithm classification and prediction (Xin et al., 2018). Its the- software that play an essential role in combating intrusions and
ories and methods have been widely used to solve complex prob- malicious behavior in modern organizations. Its function is not
lems in engineering applications and scientific fields, such as Natu- to eliminate but to guard against network attacks. It determines
ral language understanding, non-monotonic reasoning, machine vi- whether anomalies occur by detecting traffic or logs. If there is
sion, pattern recognition, text understanding, sentiment analysis, any abnormality, it will send an alarm to the system’s manage-
image retrieval and understanding, and analysis of graph and net- ment unit. The intrusion detection system has a proactive security
work data, etc. With the development of machine learning technol- defense technology to identify intrusion behavior. If an intrusion
ogy and the increasing complexity of network intrusion detection behavior has occurred, it will actively respond and take measures
problems, many researchers have applied machine learning algo- to prevent it and collect intrusion evidence (Kolandaisamy et al.,
rithms to network intrusion detection systems and achieved good 2020). It is essential to use and improve the intrusion detection
detection results (Gumusbas and Cybersecurity, 2022). The method system. There is no safe system in the world. Target systems are
based on a machine learning algorithm is mainly used in abnor- often attacked by two types of users (Williamson, 2003). One is
mal intrusion detection systems, and its basic process is shown in a legitimate user part of the system but exceeds the trust range
Fig. 1. The most important part of the process is to define the anal- specified by the system when used. The other is an illegal user
ysis goals, because the data and models required to study different who is not identified by the system but operates or attacks the
intrusion detection problems will differ. target system.
Many researchers have improved the machine learning algo-
rithm and introduced the improved method into the network in- 2.1. Common intrusion detection framework
trusion detection system, and proposed many practical defense
technology algorithms, models, and systems (Handa et al., 2019). In April 1980, Anderson (Anderson, 1980) first proposed the
This paper selects the algorithms of machine learning in network concept of intrusion detection. After that, Lunt et al. (1988) pro-
intrusion detection systems in the past ten years for overview and posed an expert system for intrusion detection. In 1986,

2
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

Fig. 2. A common intrusion detection framework.

Fig. 3. Intrusion detection system classification.

Denning (1987) proposed an intrusion detection model consist- 2.2.1. Signature detection
ing of six parts: objects, subjects, abnormal records, activity rules, Signature detection is an intrusion detection method based
audit records, and behavior profiles, which received extensive at- on pattern matching. It identifies and characterizes the intru-
tention. Kahn et al. (1998) proposed a Common Intrusion Detec- sion behavior and then constructs a database. When the data in
tion Framework (CIDF) after integrating various intrusion detection the network activity information matches the characteristics con-
models. It has a simple structure and is divided into four parts, tained in the database, the intrusion detection system will issue
which are event generator, event analyzer, response unit, and event a warning. The basic model is shown in Fig. 4. There are many
database, as shown in Fig. 2. signature-based intrusion detection methods, including pattern
matching method (Baig and Salah, 2016), expert system method
(Benferhat et al., 2013), state transition analysis method (Ilgun and
2.2. Intrusion detection system classification based on detection Kemmerer, 1995), and so on. The current technology of rule-based
method intrusion detection is relatively mature, and many mature commer-
cial products have appeared. Intrusion detection systems such as
The intrusion detection system can be divided into host- Modsecurity have been widely used.
based and network-based detection according to different data The main advantages of signature intrusion detection technol-
sources. According to the different detection methods, it can be di- ogy are: (1) only need to collect relevant data, the system burden
vided into signature detection and anomaly detection (Otoum and is small; (2) similar to the virus detection system, the accuracy and
Nayak, 2021). A variety of different algorithms can be used under efficiency are relatively high; (3) the technology is relatively ma-
different detection methods, as shown in Fig. 3. ture.

3
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

2.3. Application of traditional machine learning algorithm in


intrusion detection system

This paper mainly introduces the application of decision trees,


SVM, and Naive Bayes and their improved algorithms in intrusion
detection, summarized in Table 1.

2.3.1. Decision tree learning model


The decision tree (Cañete-Sifuentes et al., 2021) is an ancient
machine learning algorithm. Because of its excellent performance,
Fig. 4. Signature-based intrusion detection model
it is still popular today. Its structure is simple and explanatory.
Common decision tree algorithms are ID3, C4.5, CART, etc.
(1) Comparative analysis of different decision tree algorithms in
intrusion detection.
In the research in intrusion detection, the decision tree has
achieved relatively good results. Hota and Shrivas (2014) studied
the application of various decision trees in intrusion detection.
This article uses the NSL-KDD data set to conduct a comparative
study on the five decision tree algorithms of ID3, C4.5, CART, REP
Tree, and Decision Table. The results show that the C4.5 algorithm
has the best effect, the classification accuracy reaches 99.56%, and
when the Info Gain feature selection technique is used, its accuracy
Fig. 5. Anomaly-based intrusion detection model
reaches 99.68%.
From this literature, it can be known that the decision tree clas-
sification ability depends on the selection of algorithms and fea-
However, this method also has significant limitations: (1) Un-
tures. However, it is difficult to test the generalization ability of the
known intrusion behavior cannot be detected, and the false-
algorithm because the literature only selects the NSL-KDD train-
negative rate is relatively high; (2) The correlation with the system
ing set and uses the hold-out method for testing instead of the
is robust, the implementation mechanisms of different operating
NSL-KDD test set. The NSL-KDD test set contains malicious behav-
systems are different, and the attack methods are also different, so
iors that the NSL-KDD training set does not. Experiments are still
it is difficult to define a unified pattern library (Yin, 2019).
needed to gain a complete understanding of decision trees.
(2) Application analysis of fusion optimization decision tree al-
gorithm in intrusion detection.
2.2.2. Anomaly detection Due to the simple structure and wide variety of decision trees,
Abnormal intrusion detection requires a large amount of histor- when applied to various fields, the structure is rarely improved,
ical activity data and time-series information to establish a normal and the optimized data is often classified.
behavior profile and to judge the abnormality of the activity to be Bagyalakshmi and Samundeeswari (2020) used various machine
tested, as shown in Fig. 5. In abnormal intrusion detection, abnor- learning algorithms to detect intrusion detection datasets pro-
mal activities include intrusion behaviors. Ideally, abnormal activ- cessed by learning vector quantization or PCA and performed a
ities are intrusion behaviors. However, unusual activity and intru- comparative analysis. The results show that the decision tree clas-
sion are not always the same. There are four cases. (1) Intrusion sification works best, but the article does not specify which deci-
behavior is not abnormal activity; (2) Abnormal activity is not in- sion tree model is used.
trusion behavior; (3) Neither intrusion behavior nor abnormal ac- Umak and Mishra (2014) used the multi-swimming particle
tivity; (4) Both intrusion behavior and abnormal activity. Anomaly swarm optimization method to reduce the dimension of the intru-
intrusion detection is to identify intrusion behaviors in the con- sion detection data set NSL-KDD and then classified it using C4.5.
structed anomalous activity set, mainly through the establishment The results show that feature extraction technology can improve
of anomaly models. Anomaly detection is a detection technology accuracy and shorten the training time, which is of great signifi-
that infers changes in user activities through known deviations of cance for real-time monitoring of intrusion detection. The experi-
measured values and then makes judgments (Jiang et al., 20 0 0). ment did not use the test set for testing but the hold-out method,
Commonly used anomaly detection algorithms mainly include so the overall accuracy was higher.
intrusion detection methods based on statistics (Luca et al., Mahbooba et al. (2021) did not focus on the classification ability
2017), intrusion detection methods based on machine learning of intrusion detection but proposed the concept of explainable ar-
(Nassif et al., 2021), and intrusion detection methods based on data tificial intelligence to enhance trust management that experts can
mining (Sun et al., 2019). Compared with signature detection, the understand, such as underlying data evidence and causal reason-
advantage of anomaly detection is that it is easier to discover new ing. Since the models constructed by most artificial intelligence al-
attack types. However, due to the immature technology, anomaly gorithms belong to the black box, and the decision tree structure is
detection can easily mistake rare normal behavior for abnormal be- simple and easy to grasp, the literature uses the decision tree algo-
havior. rithm as the research object. The importance of features based on
Although anomaly-based intrusion detection technology is not entropy measurement for intrusion detection is studied. The intru-
mature enough, some projects have been developed. For exam- sion classification rules extracted by the decision tree method are
ple, the SuStorID system, developed by a team at the Univer- explained, which provides a reference for subsequent research on
sity of Cagliari in Italy, is an anomaly intrusion detection sys- interpreting more complex deep learning models.
tem based on machine learning. The SuStorID system is open In recent years, the decision tree has not appeared in the litera-
source and has not been used in actual production and practice. ture as the primary research object but as a comparison algorithm
Learning-related knowledge can be accessed by visiting the web- or to optimize other algorithms. The research of decision trees in
site: http://comsec.diee.unica.it/sustorid/. the field of intrusion detection is mainly combined with other al-

4
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

Table 1
Summary of intrusion detection based on traditional machine learning.

Paper Proposed Method Goal/Success Model

Hota and A comprehensive and comparative analysis of the Using the NSL-KDD dataset, the C4.5 algorithm Decision Tree
Shrivas (2014) decision trees. works best, and when using the Info Gain feature
selection technology, its accuracy reaches 99.68%.
Bagyalakshmi and DDoS attack classification on cloud environment The decision tree technology based on learning
Samundeeswari (2020) using machine learning techniques with different vector quantization has the best classification
feature selection methods. effect, with an accuracy of 98.74%.
Umak and An efficient modular approach of intrusion Based on the MSPSO-DT system is improved
Mishra (2014) detection system based on MSPSO-DT. intrusion detection accuracy more than 99.43%,
speedup testing time is 11.13 sec.
Mahbooba et al. (2021) The importance of features based on entropy The concept of explainable artificial intelligence is
measure for intrusion detection is studied based proposed, and the intrusion classification rules
on decision trees. extracted from the decision tree method are
explained, which provides a reference for
subsequent research on the interpretation of
more complex deep learning models.
Meddeb et al. (2019) A set of IDS nodes is proposed to supervise and On the constructed new dataset, the SVM-based SVM
analyze the behavior of the observed system, and intrusion detection model achieves high-precision
collect and construct training data. detection for ad hoc networks.
Al-Qatf et al. (2018) Using the SAE algorithm to reconstruct the Accuracy reaches 95.09% and 84.96% on KDD99
dataset. and NSL-KDD datasets.
Shen et al. (2020) Feature extraction from data using fuzzy rough Realize high-precision and fast identification,
set theory. reduce system burden, and achieve 99.84%
accuracy on NSL-KDD data set.
Wang and Jin (2020) SVM is used for initial classification, and then a For small scale data, the two-level intrusion
density-based clustering algorithm is used for detection algorithm combined with fuzzy support
sub-classification. vector machine and DBSCAN achieves high
classification accuracy while maintaining fast
training speed.
Feng et al. (2014) A new algorithm CSVAC is proposed, which Experiments are performed using the KDD99
combines the modified SVM and CSOACN. dataset, and the model implements incremental
learning.
Kabir et al. (2018) An optimal allocation-based least squares support The model achieves incremental learning without
vector machine (OA-LS-SVM) intrusion detection reducing the generalization ability of LS-SVM,
model. while reducing training time.
Pozi et al. (2015) An intrusion detection model integrating genetic The model has good generalization and can be
programming and support vector machine used to discover new types of network attacks.
algorithm.
Yao et al. (2015) Intrusion detection model based on decision tree Alleviates the constraints of Naive Bayes Naïve Bayes
and Naïve Bayes classification. conditional independence and avoids the problem
of overfitting of decision trees.
Wang et al. (2014) Improved Naive Bayes algorithm. The TrS and TeS of the KDD99 intrusion detection
data set are used for simulation experiments, and
the accuracy is 98.21%.
Zhang et al. (2018) The KDD99 dataset was preprocessed using IPCA Reduce model training and testing time.
and detected using the Gaussian Naive Bayes
algorithm.
Gu and Lu (2021) A Naive Bayes algorithm using feature embedding Convert low-quality data into high-quality data
ideas. with obvious feature categories.

gorithms to improve the detection rate, so there is little room for NetBigData (DoS attack) for in-vehicle ad hoc networks and
development in the future. achieved high-precision intrusion detection for ad hoc networks.
Al-Qatf et al. (2018) proposed the SAE-SVM intrusion detection
2.3.2. Support vector machine learning model model, which used the SAE algorithm to reconstruct the data set
Support Vector Machine (SVM) is a binary classification so that the new data set created could have good characteris-
algorithm developed from the generalized portrait algorithm tics. Shen et al. (2020) proposed the FRST-SVM model, which used
(Hearst et al., 1998). It took nearly 40 years from its emergence fuzzy rough set theory to extract features from data. The SAE-SVM
to the more mature modern SVM concept proposed by Vapnik and FRST-SVM models achieve high-precision and rapid recogni-
et al. SVM has not only a good classification effect but also good tion, reducing the system burden.
portability, stability, and robustness (Pan et al., 2020). It is a bi- SVM can not only detect intrusion data but also be applied
nary linear classifier. Its ability to deal with large-scale data and to intrusion detection as a data processing algorithm. The intru-
multi-classification problems is weak in the case of unimproved sion detection model proposed by Wang and Jin (2020) uses SVM
(Chauhan et al., 2019), so improving the SVM algorithm has be- to initially classify the data set and then uses the classification
come the focus of research. result as the input of the density-based clustering algorithm for
(1) Research on SVM intrusion detection based on data opti- sub-classification. Although the experimental results are good, the
mization. model is complex and proposed based on a small dataset. When
Although the detection effect of SVM is excellent, in the the amount of data is enormous, it will expose the drawbacks of
case of a large amount of data and high dimension, its train- the long training time of SVM, so it is not universal.
ing time is longer than other algorithms. Therefore, many schol- (2) Research on incremental SVM intrusion detection.
ars have optimized the dataset to better use SVM classifi- The principle of the SVM algorithm classification is to find the
cation. Meddeb et al. (2019) collected and built a database optimal hyperplane and classify the data according to the optimal

5
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

hyperplane. The construction of the optimal hyperplane does not tage of this model is that it relieves the constraints of Naive Bayes
require all support vectors and can be drawn only according to the conditional independence and avoids the problem of overfitting the
support vectors of the edges. When new data is judged as an edge decision tree.
vector, it can be classified to achieve incremental learning. Wang et al. (2014) proposed an improved Naive Bayes algorithm
To better detect dynamic data, many scholars have improved for the problem of accuracy degradation caused by different fea-
the traditional SVM algorithm. For example, the CSVAC model ture attribute sample nodes. This algorithm based on the origi-
(Feng et al., 2014) incorporates a clustering based on a self- nal model is combined with a parameter of classification control.
organizing ant colony training network (CSOACN) algorithm based Adjust and select optimal parameters to improve detection accu-
on active support vector machines. This model has the advantages racy. The experiment uses the training set and test set provided by
of SVM and CSOACN at the same time. SVM has a better classifica- KDD99 to achieve multi-class detection, and the detection results
tion effect. CSOACN can generate an adaptive model to train new are above 97%. It can be seen that the model has strong general-
category data and realize incremental learning based on fast detec- ization while improving the accuracy.
tion. Compared with other classification algorithms, the detection ac-
Kabir et al. (2018) proposed an optimal allocation-based least curacy of the Naive Bayes algorithm is not optimal, but it can
squares support vector machine (OA-LS-SVM) intrusion detection achieve fast classification. Zhang et al. (2018) took full advantage
model for new data detection. The model uses the idea of sam- of the fast classification speed of the Naive Bayes algorithm. At the
pling in statistics and can extract representative samples from the same time, to further shorten the training time and improve the
old and new data sets as the input of the detection algorithm accuracy, the authors first used IPCA to preprocess the KDD99 data
least squares support vector machine (LS-SVM). The experimental set and then used the Gaussian Naive Bayes algorithm for detec-
results show that this model does not reduce the generalization tion. From the experimental results, when the detection accuracy
ability of LS-SVM while implementing incremental learning. At the is 91%, It only takes 0.562 s to complete the training model and
same time, the training time is significantly reduced due to the use test data. This result is of great significance for the real-time mon-
of sampling ideas. itoring research of intrusion detection.
Pozi et al. (2015) defined the missing and newly added attack (2) Research on intrusion detection based on Naive Bayes fea-
data in the training set as unusually rare attacks. To improve the ture extraction.
detection rate of such attacks, the authors proposed an intrusion The Naive Bayes algorithm can not only be used as a classifier
detection model combining genetic programming and an SVM al- in intrusion detection research but also can be used to process in-
gorithm. Many experiments in the literature can prove that the de- trusion data. Gu and Lu (2021) proposed a Naive Bayes algorithm
tection rate does not decrease significantly when rare abnormal at- utilizing the idea of feature embedding to deal with the problem
tacks occur. This also shows that the algorithm has good general- of similar characteristics of normal and abnormal data and convert
ization and can be used to discover new types of network attacks. low-quality data into high-quality data with prominent feature cat-
It can be known from the above literature that the structure egories that are easy to classify. It is convenient for the classifier to
of the improved intrusion detection system is complex, and the detect.
support vector machine is difficult to deal with large-scale data. The research on traditional machine learning algorithms such as
This makes the enhanced system inefficient in solving the intru- Naive Bayes and decision trees in intrusion detection has entered
sion detection problem, and the optimization and improvement of a mature stage very early. In recent years, these algorithms have
the model still need further research. Although a support vector often appeared in the literature as part of data processing.
machine has many problems, it has potent portability and high sta-
bility, and it is easy to realize algorithm improvement research. It
2.4. Application of ensemble learning in the field of intrusion
has always been the mainstream algorithm to solve the problem of
detection
intrusion detection. Therefore, learning support vector machines is
an essential process for gaining a deep understanding of intrusion
In machine learning, different algorithms have different advan-
detection problems.
tages and disadvantages. To avoid the algorithm’s preference when
dealing with different data sets, different weakly supervised learn-
2.3.3. Naive Bayes learning model
ing algorithms can be combined to form a strongly supervised
The Naive Bayes algorithm is derived from Bayes’ theorem and
learning algorithm, that is, an ensemble learning algorithm. There
has a solid mathematical foundation. It adds to the Bayesian ap-
are many kinds of ensemble learning models. This paper mainly in-
proach the assumption that different attributes are independent of
troduces the application of random forest, XGBoost, and their im-
each other. But this assumption does not hold in reality, and it
proved algorithms in intrusion detection, which is summarized in
will inevitably affect the results of the Naive Bayes classification
Table 2.
(Zhu and Hu, 2015). However, compared to the higher complexity
of the Bayesian algorithm, the impact of Naive Bayes on the re-
sults is acceptable. The principle of the Naive Bayes classification 2.4.1. Random forest
algorithm is simply to calculate the posterior probability according In 20 01, Breiman (20 01) proposed a machine learning algo-
to the prior probability and then judge which category it belongs rithm, Random Forest, that fuses Bagging ensemble learning the-
to according to the posterior probability. Today, Naive Bayes algo- ory and the random subspace method. Multiple decision trees are
rithms have been applied in various fields, such as pattern recog- trained by Bagging ensemble learning technology and combined
nition (Koch et al., 2019), spam classification (Zhang et al., 2021), into random forests. When dealing with data classification prob-
intrusion detection (Yao et al., 2015), etc. lems, the classification results are determined by the mode of the
(1) Research on intrusion detection based on Naive Bayes clas- results obtained by all decision trees.
sification. (1) Based on the application of improved random forest in the
Yao et al. (2015) proposed an intrusion detection model based field of intrusion detection.
on decision trees and Naive Bayes. This model is an ensemble algo- The classification model (T-SNERF) proposed by
rithm. After the data is processed, the decision tree classifier and Hammad et al. 92021), combined with t-distributed random
the Naive Bayes classifier are used for detection, respectively. Then neighborhood embedding and random forest algorithm, achieves
the two results are evaluated to obtain the final result. The advan- 100% recognition accuracy on the UNSW-NB15 intrusion detec-

6
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

Table 2
Summary of intrusion detection based on ensemble learning.

Paper Proposed Method Goal/Success Model

Hammad et al. (2021) A T-SNERF intrusion detection model. Accuracy of 100%, 99.7878%, and 99.7044% on Random Forest
UNSW-NB15, CICIDS-2017, and Phishing datasets.
Iwendi et al. (2020) Classification method using CFS + ensemble The accuracy rate of multiple groups of
algorithm. experiments is above 99%.
Boahen et al. (2021) Network anomaly detection in a controlled Accuracy rates of 98.96% and 98.56% on
environment based on an enhanced PSOGSARFC. UNSW-NB15 and NSL-KDD datasets.
Morfino and Compares multiple machine learning algorithms Random forest achieves 100% accuracy on the
Rampone (2020) used in IoT network intrusion detection systems. SYN-DOS attack dataset.
Amouri et al. (2020) Process the data collected by the sniffer from the Realize network intrusion detection based on the
MAC layer and the network layer using the Internet of Things.
random forest algorithm.
Karthik and Hybrid random forest and synthetic minority Through comparative experiments, it is shown
Krishnan (2021) over-sampling technique for detecting Internet of that the RF-SMOTE model performs well in the
Things attacks. field of IoT network intrusion detection.
Bhattacharya et al. (2020) A Novel PCA-Firefly based XGBoost classification Through multiple sets of comparative XGBoost
model for intrusion detection in networks. experiments, it is proved that the XGBoost-PCA
and Firefly model performs well.
Wang and Lu (2020) Intrusion detection model based on the coupling XGBoost-LSTM models outperform XGBoost or
of XGBoost and LSTM model. LSTM models alone.
Bedi et al. (2021) An improved Siam-IDS for handling class The model can deal with the class imbalance
imbalance in Network-based Intrusion Detection problem and comprehensively obtain the attack
Systems. types in the intrusion detection data set.
Qiao et al. (2022) An XGBoost-RF intrusion detection model. The model has obvious advantages in solving the
problem of imbalanced data.
Kumar et al. (2021) An ensemble learning and fog-cloud The model implements fog computing-based
architecture-driven cyber-attack detection intrusion detection, with a detection rate of
framework for IoMT networks. 99.98% on the ToN-IoT dataset.
Xu et al. (2020) An XGBoost Intrusion Detection Scheme Based on On the premise that the detection results are not
Privacy Protection. greatly reduced, privacy protection can be
achieved.

tion dataset. It also achieves 99.7878% and 99.7044% on the work intrusion detection system, using the intrusion data targeted
CICIDS-2017 and Phishing datasets, respectively. at the Internet of Things, namely the SYN-DOS attack. All super-
Iwendi et al. (2020) proposed to use CFS to reduce the dimen- vised machine learning algorithms used in the experiments are in
sionality of the dataset and then experimented with Bagging Ran- Apache Spark’s MLlib library, and experiments are performed in a
dom Forest and Adaboost Random Forest. The results show that cloud environment. The experimental results show that the ran-
both models achieve over 99% accuracy on KDD99 and NSL-KDD dom forest has the best effect, the accuracy rate reaches 100%, and
datasets. the time required for training and testing in the cloud environment
Boahen et al. (2021) proposed the PSOGSARF intrusion detec- is significantly reduced.
tion model. The dataset was processed using PSOGSA and detected The Internet of Things is a newer network paradigm. The op-
using the random forest algorithm. erating systems used by many Internet of Things devices are rel-
These experimental results demonstrate the improved accuracy atively lightweight, which makes it impossible to install and use
and reduced training time benefit from extensive dataset pro- large and complex antivirus software. Therefore, it is necessary to
cessing before classification. Random forest is an ensemble algo- develop an intrusion detection system that can be applied to In-
rithm based on decision trees. In the face of unseen data, random ternet of Things. Amouri et al. (2020) proposed a cross-layer IoT
forest classifiers generally outperform decision trees (Nazir and network intrusion detection method, using a random forest algo-
Khan, 2020). In addition to parameter changes, few scholars have rithm to process the data collected by the sniffer from the MAC
improved the structure of random forests in recent years, and most layer and the network layer and send them centrally to the super
have achieved algorithm fusion improvements. At the same time, node. Karthik and Krishnan (2021) proposed an IoT network attack
because of its excellent classification ability, it has been widely detection model based on random forest and synthetic minority
used in various fields of intrusion detection research. oversampling technology. The authors used NSL-KDD and N-BaIoT
(2) Application of random forest in the field of IoT intrusion de- datasets and conducted experiments with SVM, decision tree, and
tection. random forest. The results show that the RF-SMOTE model has the
In recent years, with the development of 5G technology, many best detection effect.
devices can be controlled through wireless networks, realizing the
interconnection of everything, which is the origin of the concept
of the Internet of Things. With the continuous development of the 2.4.2. XGBoost
Internet of Things, it has now grown in many fields, such as in- The full name of XGBoost (Karthikraja et al., 2022) is eXtreme
dustrial control systems (Mokhtari et al., 2021), uncrewed vehicles Gradient Boosting, an optimized distributed boosting library with
(Ghaleb et al., 2020), intelligent grids (Upadhyay et al., 2021), cloud high efficiency, flexibility, and convenience, which was summa-
security (Mishra et al., 2020), etc. As a new network paradigm, the rized and proposed by Chen based on previous research. Machine
distributed nature of the Internet of Things threatens its security learning algorithms are implemented under the gradient boosting
significantly. framework. During gradient boosting, each subsequent tree is con-
Because of its complexity, randomness, accuracy, and other structed sequentially to minimize the error of the previous tree
characteristics, random forest is suitable as a core algorithm for the (Zhang et al., 2020).
Internet of Things. Morfino and Rampone (2020) compared several (1) Intrusion detection system based on improved XGBoost al-
machine learning algorithms used in the Internet of Things net- gorithm.

7
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

XGBoost algorithm can prevent overfitting and improve accu- (3) Privacy-preserving intrusion detection system based on XG-
racy and algorithm efficiency in intrusion detection research. How- Boost.
ever, with the development of the Internet and the advent of With the expansion of the network scale, there are more and
the era of big data and the Internet of Things, network security more types and quantities of network intrusion detection data.
is facing more and more new technical challenges. Based on the Training with a single source of intrusion detection data is prone
XGBoost algorithm, researchers have made improvements to deal to overfitting problems. To this end, the internationally renowned
with various practical problems. security agency proposed the Managed Detection and Response
Facing the problem of low detection efficiency due to the (MDR) scheme, which trains the detection model by uploading
vast data set of network intrusion detection, an XGBoost intru- data from multiple sources to the cloud. However, this scheme
sion detection model based on the PCA-FA algorithm was pro- has problems such as high computational cost and extended train-
posed by Bhattacharya et al. (2020). The data is encoded, nor- ing time. At the same time, the development of the Internet has
malized, and dimensionally reduced before the XGBoost classifi- brought new challenges to privacy protection. The US Prism pro-
cation algorithm runs. The experimental results show that com- gram and the exposure of some Chinese companies to collect user
pared with other algorithms, the XGBoost-PCA and Firefly model is information through APP software and use it for profit have made
optimal. people pay more and more attention to privacy protection.
As mentioned above, the Internet of Things uses a lightweight To solve the privacy problem, an intrusion detection scheme
operating system with limited computing resources on devices. based on encrypted XGBoost was proposed by Xu et al. (2020).
Facing this dilemma, Wang and Lu (2020) proposed a host-based The final model for detecting target data is obtained by integrating
intrusion detection model, which coupled XGBoost with the LSTM the encrypted detection model uploaded to the cloud. This scheme
model. The data set is the system call sequence collected from the not only achieves privacy protection but also dramatically reduces
device. The reason for using LSTM is that it is suitable for pro- the training time on the premise of slightly decreasing the detec-
cessing sequence data with long-term dependencies. Experiments tion results. However, the future will be the era of supercomputing,
show that the XGBoost-LSTM stacking model is superior to the XG- especially the acceleration of cloud computing. The intrusion de-
Boost or LSTM model alone. tection scheme based on encrypted XGBoost still needs to be im-
The network traffic data is enormous, mostly normal traffic, and proved in terms of detection rate. Otherwise, even if privacy pro-
only a few are malicious activities. Models trained with such im- tection is achieved, it will still be eliminated if there is little differ-
balanced datasets tend to impact detection results. To solve this ence in training time.
problem, some scholars raised the proportion of minority traffic in
the total data. Although the detection results are improved, the
2.5. The application of deep learning in the field of intrusion
efficiency of the model is not improved. Bedi et al. (2021) pro-
detection
posed an intrusion detection system based on a two-layer en-
semble model (I-SiamIDS). The first layer integrates b-XGBoost,
As the latest branch in machine learning - deep learning has
Siamese neural network, and deep neural network to identify at-
achieved great success in recent years, especially in the areas of
tack data. The attack data obtained in the first layer is input into
speech recognition, image classification, big data, etc. It solves
the m-XGBoost model of the second layer, and finally, a more com-
more practical problems while maintaining high predictive abil-
prehensive attack type is obtained.
ity and accuracy. This paper mainly summarizes the application of
(2) Intrusion detection system combined with XGBoost and ran-
CNN and RNN in intrusion detection, outlined in Table 3.
dom forest.
Although both XGBoost and random forest belong to ensemble
learning, with the development of the network, the intrusion de- 2.5.1. Research on intrusion detection based on CNN
tection model constructed by a simple ensemble algorithm is diffi- In recent years, researchers have used CNN algorithms to
cult to deal with complex network attacks. Research is found that achieve intrusion detection research in various fields, such as the
the results of the fusion or integration of different ensemble learn- Internet of Things (Abu Al-Haija and Zein-Sabatto, 2020), industrial
ing algorithms into new models are often better. The following in- control systems (Zhou et al., 2021), connected autonomous vehi-
troduces several intrusion detection models that combine random cles (van Wyk et al., 2020; Nie et al., 2020), and in-vehicle self-
forests and XGBoost. organizing networks (Jeong et al., 2021) and so on, and achieved
Qiao et al. (2022) proposed the XGBoost-RF model to deal with relatively good results. CNN has a strong recognition ability mainly
imbalanced data. First, the XGBoost algorithm is used to score rel- because its convolution layer has a strong ability to extract features
atively essential features in the data set, and then the random for- from data. Therefore, many scholars will first use CNN to extract
est algorithm with updated weights is used for attack detection. features from the data and then use other classifiers to make de-
The advantage of this model is that through feature selection and cisions when algorithm fusion improvements. For example, in the
weight adjustment, a small number of important data can be fully hybrid intrusion detection framework based on deep learning pro-
trained, as shown in Fig. 6. Compared with the two-layer detection posed by Khan (Khan, 2021), the role of the convolutional neural
model (I-SiamIDS) mentioned in (Bedi et al., 2021), this model per- network is to obtain local features through convolution. Riyaz and
forms better. Ganapathy (2020) used the linear correlation coefficient and con-
Kumar et al. (2021) proposed an intrusion detection model ditional random field to obtain data features and then used convo-
based on fog computing to deal with IoT network security issues. lution to extract the optimal features in the feature data for classi-
The model structure is divided into two layers. Decision trees, fication.
Naive Bayes, and random forests are used as the first-level learner, (1) Data preprocessing based on CNN intrusion detection.
and then XGBoost is used for centralized classification. The authors As the most commonly used and fastest-growing algorithm in
conduct a large number of experimental comparisons with other deep learning, the convolutional neural network is usually used to
algorithms. The experimental results show that the model is supe- process two-dimensional or multi-dimensional data such as pic-
rior to other algorithms, and the optimal results are all above 99%. tures and speech. It is often inferior to other machine learning
The above example shows that the intrusion detection model com- algorithms for one-dimensional feature data sets. Faced with this
bining random forests and XGBoost has obvious advantages com- problem, researchers transform the network intrusion detection
pared with other algorithms. data stream into two-dimensional data to train the CNN model

8
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

Fig. 6. I-SiamIDS and XGBoost-RF model experiment comparison chart.

Table 3
Summary of intrusion detection based on deep learning.

Paper Proposed Method Goal/Success Model

Li et al. (2020) Convert intrusion detection data to grayscale. Implementation of using CNN to process CNN
one-dimensional intrusion detection data.
He et al. (2021) Use CNN to pre-train on a small number of abnormal For unbalanced data, the detection accuracy is
behavior samples. effectively improved.
Zhang et al. (2020) An intrusion detection model based on SGM-CNN. Through multiple sets of comparative experiments, it
is proved that the SGM-CNN model performs well for
imbalanced data problems.
Mulyanto et al. (2021) A network intrusion detection system based on focus Reducing the weight of easy-to-classify samples
loss. effectively improves the detection rate of minority
samples.
Xiao et al. (2020) Intrusion detection model based on multi-core CNN. The use of controlled units in the convolutional layer
has experimentally proved that incremental learning
can be better achieved.
Oliveira et al. (2021) The data is reorganized in chronological order and On the dataset CIDDS-001, the accuracy reaches LSTM
detected using LSTM. 99.94%, and the f1-score is 91.66%.
Xie et al. (2020) An HTTP-based Trojan Detection Model via the In detecting HTTP Trojans, HSTF-Model is significantly
Hierarchical Spatio-Temporal Features of Traffics. higher than other algorithms, and the accuracy
reaches 99.98% and 91.41% on the BTHT-2018 and
ISCX-2012 datasets.
Hsu et al. (2021) Robust network intrusion detection scheme using 99.68% accuracy on NSL-KDD dataset.
long-short term memory based convolutional neural
networks.
Yao et al. (2021) A cross-layer feature-fusion CNN-LSTM-based Compared with the CNN and LSTM models, the loss
approach. of feature data is avoided, and the accuracy is 99.95%
and 99.79% on the KDD Cup 99 and NSL-KDD
datasets.
Sun et al. (2020) Extracting features using CNN-LSTM hybrid network The accuracy is 98.67% on the CICIDS2017 dataset.
for intrusion detection system.

(Andresini et al., 2021). For example, Li et al. (2020) divided the in- encoder for training, which improves the accuracy of anomaly de-
trusion detection data into four parts according to the correlation tection compared to using only the auto-encoder to train normal
then transformed the four parts of the data into grayscale images samples. In this model, CNN is not the final classifier but plays an
and introduced them into the multi-convolutional neural network auxiliary role.
for intrusion detection research. CNN can also be used as a classifier in the intrusion detection
(2) Research on CNN intrusion detection based on unbalanced model to deal with the problem of imbalanced data. Unbalanced
data. data can be dealt with in two ways: data-level technology and
As Internet access and connections increase, so do the types of algorithm-level technology (Khan et al., 2018). The data-level tech-
cyberattacks. Analysis of various intrusion detection data proves a nology is to balance the data before detection. The specific process
significant gap in the number of different kinds of attacks, and the is to under-sample the majority class and over-sample the minority
number of normal and abnormal is also different. Therefore, the class.
intrusion detection data has a data imbalance problem. This prob- For example, Zhang et al. (2020) proposed a CNN intrusion de-
lem is one of the essential reasons it is difficult for intrusion de- tection model SGM-CNN that combines SMOTE and Gaussian mix-
tection models to improve their performance. Some scholars have ture model-based under-sampling technique for imbalanced data.
explored the use of CNN to solve the problem of data imbalance. To test the model’s performance, the authors compared different
He et al. (2021) used an autoencoder to select normal sam- classifiers and data-level balanced sampling techniques on multi-
ples to train an anomaly detection model. To cope with imbal- ple data sets. The experiments show that the SGM-CNN model is
anced data, CNN is used to pre-train a small number of abnormal the best overall. Algorithmic-level techniques do not require bal-
behavior samples. The extracted features are input into the auto- anced preprocessing of the data and focus on a small number of

9
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

network intrusions through automatic trial and error methods to


achieve the purpose of incremental learning.
The effect of CNN on incremental learning is not as good as that
of most machine learning algorithms. However, it still has excellent
detection capabilities and development prospects in intrusion de-
tection.

2.5.2. Research on intrusion detection based on RNN


For data reasons, RNN is more suitable for handling intrusion
detection problems than CNN. It is often used to process sequence
data, such as the RNN algorithm by Kwon et al. (2020) to solve
the abnormal command problem in the power system. Because the
command data flow is different in length, the paper proposes to
use bidirectional circulating neural network to avoid ignoring the
long period command. It is worth noting that the bidirectional re-
Fig. 7. Common CNN incremental learning models (modified figure from (Li and current neural network algorithm is not used here to detect abnor-
Hoiem, 2018)).
mal behavior but to determine when the next correct command
comes.
(1) Research on intrusion detection based on LSTM.
misclassified groups by directly modifying the weights during the The LSTM algorithm is an improvement on the simple RNN al-
learning process. gorithm. It uses the idea of a gating algorithm to design the in-
Mulyanto et al. (2021) proposed a network intrusion detection put gate, forget gate, and output gate, which solves the problem of
system based on focal loss, using the focal loss function to train long-term dependence of simple RNN. However, LSTM still belongs
the CNN model, reducing the weight of easy-to-classify samples, to the RNN algorithm, so the RNN algorithm mentioned in many
balancing the data categories, and effectively improving the detec- documents is the LSTM algorithm. LSTM algorithm has achieved
tion rate of minority samples. outstanding results in intrusion detection research.
The detection model based on data-level technology relies Oliveira et al. (2021) used the LSTM algorithm to compare with
heavily on the data processing capability in the early stage. In con- random forests and multi-layer perceptron on the dataset CIDDS-
trast, the CNN model processed by algorithmic technology does 001. LSTM achieved better results, with an accuracy of 99.94% and
not require balanced data pretreatment, but its time cost increases an f1-score of 91.66%. However, the operation of this document in
compared with the data-level technology detection model. There the data preprocessing stage is beneficial to the classification of the
are still many solutions to the problem of data imbalance, which LSTM algorithm, which reorganizes the data and sorts it in time
need to be improved and innovated by more scholars. order, which is one of the reasons why the LSTM experimental re-
(3) Research on incremental CNN intrusion detection. sults are more excellent. But this does not negate the advantages
Conventional network intrusion detection systems are usually of the LSTM algorithm for processing intrusion detection data, es-
preconfigured to detect malicious network attacks. Today, attack- pecially when there are temporal features in the data.
ers have gone deeper and can try to circumvent common detection (2) Research on intrusion detection based on LSTM and CNN.
rules (Satheesh Kumar et al., 2022). Therefore, discovering new In recent years, there have been many studies using deep learn-
types of attacks has become one of the hotspots and difficulties ing to improve network intrusion detection systems. Different deep
in future research. learning algorithms have different advantages and characteristics.
Compared with other machine learning algorithms, CNN is Therefore, many deep learning combined algorithms have emerged,
more difficult to achieve incremental learning. The introduction of among which the number of combined CNN and LSTM algorithms
large amounts of new data can make CNN suffer from catastrophic is mostly. Because the convolutional layer can extract the spatial
forgetting. Some scholars have improved the structure (Li and features of the data well, and the LSTM can extract the time-series
Hoiem, 2018). As shown in Fig. 7, the part marked in red repre- features of the data, the combination of the two will significantly
sents the dedicated output layer that increases with the increase improve the detection accuracy. Fig. 8 shows the four CNN and
of data categories. Based on this network, there are many differ- LSTM combination algorithm models in the literature (Xie et al.,
ent training methods to achieve incremental learning. But these 2020; Hsu et al., 2021; Yao et al., 2021; Sun et al., 2020). Except
methods either have difficulty balancing model learning rates for for Fig. 8(c), combined in parallel, the other models are CNN and
new and old classes or take a long time to train with all the data. LSTM connected in series.
As types increase, the structure becomes more prominent, and the Refs. Xie et al. (2020), Hsu et al. (2021), Yao et al. (2021),
training becomes more inefficient, which may eventually make the Sun et al. (2020) in Table 3 show the specific information of sev-
system collapse. eral CNN-LSTM algorithms. Among them, HSTF-Model is a detec-
The intrusion detection model based on a multi-kernel convolu- tion model based on HTTP Trojan. In its comparative experiments,
tional neural network proposed by Xiao et al. (2020) is an excellent the accuracy of the HSTF-Model can reach 99.98%, while the detec-
method. The core of the algorithm is to offer controlled units in the tion accuracy of other algorithms is low, and even the accuracy is
convolutional layer, which can learn new data based on remember- 0%. The experimental results show that, compared with other mod-
ing the original category. However, this model is still fine-tuned in els, this model is more suitable for handling HTTP Trojan detection
the original network, which will impact the original category clas- problems, and the model has a strong generalization ability.
sification. At the same time, to better train new categories, data It can be found from Fig. 8 that the CNN-LSTM (b) is similar in
balance processing is required before model training. structure to the DL-IDS (d) model. However, deep learning mod-
Although the current effect of CNN on incremental learning is els have many hyperparameters, and even if they are connected by
not good, it does not mean it is difficult for deep learning mod- CNN and LSTM, their structure and running results will differ.
els to achieve incremental learning. Alavizadeh et al. (2022) intro- The cross-layer feature fusion CNN-LSTM model proposed by
duced a reinforcement learning method for network intrusion de- Yao et al. (2021) avoids losing feature data compared to the single-
tection based on deep Q-learning, which detects different types of model, serial CNN-LSTM. When using the NSL-KDD dataset for de-

10
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

Fig. 8. Intrusion detection model based on CNN and LSTM.

tection, the cross-layer feature fusion CNN-LSTM model outper- Table 4


Research direction in the field of intrusion detection.
forms the CNN-LSTM model.
Direction of Specific contents
3. Experiment and comparative analysis development
Model Ensemble,fusion
improvements
The above summarizes the development of various machine Data processing Feature extraction, dimensionality reduction,
learning algorithms in intrusion detection in the past ten years. It unbalanced data processing, data dimension
can be seen that the related research mainly focuses on three as- transformation
pects, as shown in Table 4. Application field Internet of Things, industrial control systems,
connected cars, in-vehicle ad hoc networks
This paper introduces the research progress in intrusion de-
tection and conducts comparative experiments. The most impor-
tant purpose of the experiment is to compare the detection perfor-
mance of machine learning algorithms in the field of intrusion de-
tection. Most of the models in other literature are improved mod- 3.1. Dataset introduction
els and pay less attention to the performance of the original model.
Even if the models are the same, the differences in data and data The experiments use KDD 99 (KDD Cup 1999 Data October 28,
processing methods will have a more significant impact on the de- 1999) and its improved NSL-KDD dataset (Tavallaee et al., 2009).
tection results, so this paper provides a comparative experiment. These two datasets are selected mainly because they are used by

11
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

Table 5 The above evaluation indicators are the most commonly used in
Confusion matrix.
evaluation algorithm models, but they still have shortcomings. For
Predict example, in the case of unbalanced data, these evaluation indica-
Actual normal abnormal tors cannot reflect the predictive ability of the model well.
normal TN FP Therefore, a new evaluation index AUC was added to the exper-
abnormal FN TP iment. AUC is the area enclosed by the ROC curve and the coordi-
nate axis. Its value range is [0.5, 1]. The closer to 1, the better the
model effect. Since AUC is not affected by unbalanced data, it can
most of the literature and are often used to examine the detection better reflect the overall detection level of the model.
performance of the model, which is authoritative. In addition to examining the detection ability of the model, the
The KDD 99 dataset is created by simulating the US Air Force training and testing time of the model also needs to be considered.
LAN environment and later used publicly in the CUP competition. When the detection ability of the model is similar, the shorter the
It includes 5 million pieces of network connection data. The exper- time, the better. Short time is of great significance for real-time
iment uses 10% of the training samples provided by it (more than monitoring of intrusion detection.
490,0 0 0 instances) and corrected test samples with labels (more
than 310,0 0 0 instances). Labels are normal and abnormal, and ab- 3.3. Data preprocessing
normal can be divided into four categories, with 39 attack types.
Since 10% of the training samples only contain 22 attack types. In The KDD99 and NSL-KDD datasets contain character data, and
contrast, the test samples have 39 attack types, neither the hold- there is a sizeable numerical gap due to different measurement
out method nor the cross-validation method on the training sam- units. Therefore, it is necessary to digitize the data characters and
ples can better reflect the performance of the algorithm. normalize the data.
The NSL-KDD dataset removes redundant information and du- Due to the high data dimension, more time and space are re-
plicates records in KDD99, which can better reflect the detection quired for experiments, and the efficiency is low, so the Princi-
ability of the algorithm, but it cannot reflect the real network. The pal Component Analysis (PCA) is used for dimensionality reduc-
training set contains more than 120,0 0 0 network connection data, tion. To find the appropriate dimension, the random forest was
and the test set contains more than 20,0 0 0 pieces. used to conduct experiments on KDD99 and NSL-KDD data (41 di-
mensions) and various dimension-reduced data. The experimental
3.2. Evaluation indicators results are shown in Tables 6 and 7. By comparison, it is found that
the detection effect is the best when the KDD99 feature dimension
Before introducing several evaluation indicators, first under- is 12, and the detection effect is the best when the NSL-KDD fea-
stand TP, TN, FP, and FN, as shown in Table 5. ture dimension is 10. To make the experiment more rigorous, algo-
TN: Indicates correctly predicted as normal. rithms such as Naive Bayes and decision trees are used for testing,
FP: Indicates mispredicted as abnormal. and the data dimensions are determined after comparison.
FN: Indicates mispredicted as normal.
TP: Indicates correctly predicted as abnormal. 3.4. Experimental process and analysis
The experiments select accuracy, precision, recall, F1, AUC, and
algorithm running time as evaluation indicators. The operating system used in the environment platform built
In intrusion detection, detection accuracy is the percentage during the experiment is Windows 10, the CPU frequency is
of correctly identified normal and abnormal data to all detected 2.60GHz, the memory is 8GB, the 1T hard disk storage is empty,
data, which represents the overall detection level of the algorithm and the programming tool is PyCharm2019.3.3. Use decision tree,
model. Because of its convenient calculation, it belongs to the most SVM, Naive Bayes, random forest, and XGBoost in the sklearn li-
commonly used algorithm model comparison index. It is calculated brary for experiments. Select the sequential model in the Keras li-
as follows: brary to construct CNN, RNN, and LSTM for experiments. The ex-
TP + TN perimental model is adjusted based on the default parameters. The
accuracy = specific parameters and structures are shown in Table 8. Use the
TP + TN + FP + FN
training and test set of the above KDD99 and NSL-KDD intrusion
Precision describes the percentage of correctly predicted data
detection data sets to conduct simulation experiments. The exper-
out of all data predicted to be abnormal behavior. The higher the
imental results are shown in Figs. 9–12.
precision rate, the lower the misjudgment rate of the algorithm
From an overall perspective, Fig. 9 shows the accuracy, F1, and
model for normal behavior data. Calculated as follows:
AUC of each model on the KDD99 dataset (because the SVM train-
TP
precision = ing time for 490,0 0 0 data is too long, and even the optimal hyper-
TP + FP plane cannot be found, the SVM model uses 10 0,0 0 0 data sets after
The recall rate is the percentage of all abnormal behavior data stratified sampling of 10% of the KDD99 data set). It can be seen
that are accurately predicted to be abnormal, which represents the that the classification effect of the Naive Bayes classifier is poor,
ability of the algorithm model to identify abnormal behavior data. and the indicators are the lowest. On the contrary, XGBoost has the
The higher the recall rate, the lower the model’s missed detection best overall performance, but the advantage compared with other
rate for abnormal data. machine learning algorithms is not obvious.
TP Fig. 10 shows the experimental results on the NSL-KDD dataset.
recall = Comparing Figs. 9 and 10, it can be found that the values of each
TP + FN
index in Fig. 10 are significantly reduced. Because the NSL-KDD
F1 is an evaluation index to measure the two-class model. Its
dataset is an improvement of the KDD99 dataset, it removes many
value is obtained by weighting the precision rate and the recall
redundant duplicate data so that the detection algorithm will not
rate, which represents the quality of the model to a certain extent.
be biased towards data with a high repetition rate. It can be seen
2 ∗ precision ∗ recall from Fig. 10 that the accuracy of the integrated learning algorithm
F1 = is lower than that of F1, but the AUC is the best. The difference
precision + recall
from before is that although the classification effect of Naive Bayes

12
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

Table 6
Experimental comparison of KDD99 feature dimension.

Dimension 41 25 20 15 12 11 10 9

Accuracy 92.7% 92.3% 92.5% 92.7% 92.8% 92.4% 92.5% 92.2%


Precision 99.9% 99.9% 99.9% 99.8% 99.9% 99.9% 99.8% 99.8%
Recall 91.1% 90.5% 90.8% 91.1% 91.1% 90.6% 90.8% 90.5%
AUC 0.966 0.983 0.963 0.964 0.966 0.966 0.962 0.957

Fig. 9. Experimental results of KDD99 group.

Table 7 After removing redundant duplicate data, the proportion of attacks


Experimental comparison of NSL-KDD feature dimension.
that do not appear in the training set in the test set increases, so
Dimension 41 25 20 12 11 10 9 there is a big difference in the recall rates of the two images. In
Accuracy 77.1% 75.3% 72.1% 62.6% 74.2% 74.3% 68.2% Fig. 12, the recall rate of Naive Bayes is much higher than other
Precision 96.9% 97.6% 94.4% 95.7% 96.2% 96.1% 95.6% machine learning. Still, its precision is the lowest among all mod-
Recall 61.8% 58.0% 54.2% 36.0% 56.9% 57.1% 46.2% els, indicating that on this dataset, compared with other models,
AUC 0.901 0.871 0.893 0.867 0.886 0.891 0.866 Naive Bayes is easier to identify unknown abnormal behavior, but
poor recognition of learned data. On the contrary, ensemble learn-
ing, such as random forest, has the highest accuracy, but the recall
is reduced, the decline is slight, and the overall ranking is high, rate is low, which means that it has a high recognition rate for
which shows that a large amount of repeated data has less impact the learned data, but a relatively poor recognition ability for un-
on Naive Bayes than other models. known data. The overall performance of the deep learning model
To further study the performance of the algorithm, this paper is not particularly outstanding, but its generalization is better than
splits F1 into precision and recall, as shown in Figs. 11 and 12. It ensemble learning, and its accuracy is better than traditional ma-
can be seen from the two figures that the precision rate is higher chine learning. CNN and LSTM models outperform RNN with simi-
than the recall rate, and this is because these algorithms have a lar accuracy values, but LSTM has better generalization.
high recognition ability for the trained data. There are few mis- Time is also one of the criteria for evaluating algorithms. On
judgments in the data detected as abnormal activity. Accuracy is the NSL-KDD dataset, Naive Bayes is the fastest, which takes less
high. The test set used in the experiment has 17 more attack types than 0.2 s, followed by decision tree 0.76 s, random forest, and
than the training set, resulting in many malicious activities that XGBoost is 5.9 s and 11.5 s respectively, and SVM is the slowest,
cannot be correctly identified, and the recall rate is generally low. taking 2965 s. The time taken for deep learning is related to the

Table 8
Experimental model parameters.

Model Parameters/Structures

Decision Tree max_depth=2; criterion="entropy"


SVM kernel=’linear’; C=1.5; gamma=’auto’; probability=True;
Naive Bayes default
Random Forest n_estimators=39; max_depth=None; min_samples_split=2; random_state=0
XGBoost min_child_weight=3; gamma=0.2; subsample=0.7; colsample_bytree=0.7; reg_lambda=2; seed=1000
CNN A one-dimensional convolutional layer with 64 convolution kernels with a length of 3; a one-dimensional max pooling
layer with a pooling window size of 2; a flattening layer; a fully connected layer with an activation function of relu; a
forgetting layer, the forgetting rate is 0.5; a fully connected layer, the activation function is sigmoid; the number of
iterations is 20
RNN A simple RNN layer, the number of neurons is 4; a forgetting layer, the forgetting rate is 0.1; a fully connected layer, the
activation function is sigmoid; the number of iterations is 15
LSTM (one LSTM layer, the number of neurons is 32; one forgetting layer, the forgetting rate is 0.1)∗ 4; one fully connected layer,
the activation function is sigmoid; the number of iterations is 10

13
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

Fig. 10. Experimental results of NSL-KDD group.

Fig. 11. Precision and recall of KDD99 groups.

Fig. 12. Precision and recall of NSL-KDD groups.

14
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

number of iterations. CNN takes 625 s for 20 iterations, RNN takes types of attacks in the data set, and the distribution is highly un-
321 s for 15 iterations, and LSTM takes 1026 s for 10 iterations. balanced. For example, in the millions of data, the amount of data
of some attack types is only one digit, making it difficult for a
3.5. Experiment summary small amount of data to be fully trained. Nowadays, the main-
stream method is to increase the proportion of a small amount of
Combining all the data, it can be seen that ensemble learning data. Although the detection results are optimized, the improve-
has a good effect on intrusion detection research. Although the re- ment is limited, and further improvement plans are still needed.
call rate is the lowest in the NSL-KDD data set, the overall indi- (4) How to complete incremental learning when faced with
cator AUC is the best, and NSL-KDD does not conform to the ac- new categories of data.
tual intrusion detection data. The deep learning algorithm is not At present, the machine learning algorithms that can achieve
particularly prominent in this experiment, but its optimal results incremental learning are mainly SVM and ensemble learning, but
are affected by the structure, hyperparameters, and the number the accuracy of these algorithms is not high. Some scholars have
of training iterations, which need further in-depth study. Although tried to use deep learning to achieve, but the results are unsat-
the decision tree in the traditional machine learning algorithm has isfactory. How to identify a small number of new types of attack
a good effect, it is inferior to ensemble learning, so it is gradually data and completing incremental learning has become a difficulty
replaced by ensemble learning. SVM is difficult to process large in current research.
order of magnitude data, so it needs to study incremental learn-
ing. Although the Naive Bayes algorithm has low recognition accu- 5. Conclusions
racy for learned data, it has obvious advantages over other models
when faced with new types of attacks. Its training speed is faster, This paper summarizes the application and research of machine
so it can be used to detect new types of data. learning in network intrusion detection systems. By comparing and
analyzing some common machine learning algorithms in the intru-
4. Future research directions sion detection field in recent years, we can understand the relevant
information of different machine learning algorithms. In the area of
Through the above analysis and comparison, it can be seen that intrusion detection, there have been few studies on traditional ma-
machine learning has achieved certain results in network intrusion chine learning algorithms in recent years. There is more research
detection, and many algorithm models have been proposed or put on ensemble learning and deep learning models. The development
into use. However, the current model still has many limitations and of ensemble learning tends to mature and is applied to various
faces many problems and challenges. Further research is needed, fields, while deep learning is still in the exploration of algorithm
mainly including: models. To further understand the relevant algorithms, this paper
(1) How to preprocess when faced with different datasets. uses KDD99 and NSL-KDD as experimental data to conduct exper-
There are many network intrusion detection data sets, and the imental research on the algorithm. From the experimental results,
preprocessing methods for different data sets are also different. For the overall effect of ensemble learning is better, but other algo-
example, in the dimensionality reduction process of the experi- rithms also have their corresponding advantages. In the later stage,
ment in this paper, there is a situation where the dimensions are we can try to explore new models to achieve the advantages of
not very different, and the detection results are pretty different. integrating multiple algorithms as much as possible. The experi-
NSL-KDD is a dataset optimized by KDD99. Even if the dimensions ments in this paper only use KDD series data, and other data and
of the two datasets are the same and the meaning of each dimen- preprocessing methods need to be studied. The experiment real-
sion data is the same, the optimal dimension reduction is different. izes the two-classification of normal and abnormal data and does
What’s more, most network intrusion detection datasets have dif- not perform multi-classification detection. At the same time, after
ferent dimensions and even different formats. Moreover, the pro- careful division, there are many types of intrusion behaviors, and
cess of data preprocessing not only includes dimensionality reduc- there are many new types of data in real life. The distribution of
tion, but also character numeralization, data normalization, and so these types of intrusions is highly uneven. Therefore, in the later
on, which makes it difficult to unify the way of data preprocessing stage, it is still necessary to explore the detection capabilities of
when facing different data sets. Therefore, constructing an algo- various machine learning algorithms for unbalanced and new types
rithm model and system that can adaptively find the optimal data of data and to find or build good algorithms.
preprocessing method also needs further research.
(2) How to build a model that takes into account both small Declaration of Competing Interest
data sets and large data sets.
Algorithmic models such as neural networks and ensemble We declare that we have no financial and personal relationships
learning have demonstrated their excellent detection capabilities with other people or organizations that can inappropriately influ-
in the face of large data sets. For small data sets, there are ence our work, there is no professional or other personal interest
also related algorithm studies. For example, literature (Feng et al., of any nature or kind in any product, service and/or company that
2014) proposed a detection model for small data sets. The test could be construed as influencing the position presented in, or the
results were excellent. On the contrary, the detection accuracy review of, the manuscript entitled.
of the model dealing with large datasets decreases when faced
References
with small datasets, and the training time of models dealing with
small datasets is significantly prolonged when dealing with large
Liu, Y.J., 2017. National security strategy and its improvement. Expand. Horiz. 4,
datasets. Moreover, due to the large scale of network intrusion de- 5–10.
tection data sets, there are few pieces of research on network in- National Computer Network Emergency Response Technical Team/Coordination Cen-
ter of China(CNCERT/CC) [Internet]. China internet network security monitoring
trusion detection based on small data sets, and models that con-
data analysis report in the first half of 2020. https://www.cert.org.cn/publish/
sider both small and large data sets are even rarer. Therefore, fur- main/upload/File/2020Report(2).pdf, 2020 (accessed 15 March 2021).
ther research is needed in this area. Wu, Z.H., 2019. Information Security Technology and Practice, 5th ed. Liaoning Sci-
(3) How to perform multi-classification for imbalanced data. ence and Technology Publishing House, Shenyang.
Xin, Y., Kong, L.S., Liu, Z., Chen, Y.L., Li, Y.M., Zhu, H.L., et al., 2018. Machine learn-
The problem of unbalanced data has always been the focus and ing and deep learning methods for cybersecurity. IEEE Access 6, 35365–35381.
difficulty of network intrusion detection research. There are many doi:10.1109/ACCESS.2018.2836950.

15
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

Gumusbas, D., Yildirim, T., 2022. AI for Cybersecurity: ML-Based Techniques for In- Yao, W., Wang, J., Zhang, S.L., 2015. Intrusion detection model based on decision
trusion Detection Systems. Advances in Machine Learning/Deep Learning-based tree and Naïve-Bayes classification. J. Comput. Appl. 35 (10), 2883–2885.
Technologies, pp. 117–140. doi:10.1007/978- 3- 030- 76794- 5_7. Wang, H., Chen, H.Y., Liu, S.F., 2014. Intrusion detection system based on improved
Handa, A., Sharma, A., Shukla, S.K., 2019. Machine learning in cybersecurity: a re- Naïve Bayesian algorithm. Comput. Sci. 41 (04), 111 -115+119.
view. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 9 (4), e1306. doi:10.1002/ Zhang, B., Liu, Z.Y., Jia, Y.G., Ren, J.D., Zhao, X.L., 2018. Network intrusion detection
widm.1306. method based on PCA and Bayes algorithm. Secur. Commun. Netw. 2018 (10),
Kolandaisamy, R., Noor, R.M., Kolandaisamy, I., et al., 2020. A stream position perfor- 1–11.
mance analysis model based on DDoS attack detection for cluster-based routing Gu, J., Lu, S., 2021. An effective intrusion detection approach using SVM with naïve
in VANET. J. Ambient Intell. Humaniz. Comput. 6, 1–14. Bayes feature embedding. Comput. Secur. 103, 102158. doi:10.1016/j.cose.2020.
Williamson, M.M., 2003. Resilient infrastructure for network security. Wiley Subscr. 102158.
Serv. Inc. A Wiley Co. 9 (2), 34–40. Breiman, L., 2001. Random forest. Mach. Learn. 45, 5–32.
J.P. Anderson Computer security threat monitoring and surveillance. 1980. Hammad, M., Hewahi, N., Elmedany, W., 2021. TNERF: a novel high accuracy ma-
T.F. Lunt, R. Jagannathan, R. Lee, S. Listgarten, D.L. Edwards, P.G. Neumann, et al. chine learning approach for Intrusion detection systems. IET Inf. Secur. 15 (2),
IDES: the enhanced prototype AReal-time intrusion-detection expert system. 178–190.
1988. doi:https://doi.org/10.13140/RG.2.1.3905.1685. Iwendi, C., Khan, S., Anajemba, J.H., Mittal, M., Alenezi, M., Alazab, M., 2020. The
Denning, DE., 1987. An intrusion-detection model. IEEE Trans. Softw. Eng. 13 (2), use of ensemble models for multiple class and binary class classification for
222–232. improving intrusion detection systems. Sensors 20 (9), 2559.
C. Kahn, P.A. Porras, S.S. Chen, B. Tung A common intrusion detection framework. Boahen, E.K., Elvire, B., Wang, C., 2021. Network anomaly detection in a con-
Position Paper of Information Survivability Workshop. 1998. trolled environment based on an enhanced PSOGSARFC. Comput. Secur. 104 (4),
Otoum, Y., Nayak, A., 2021. AS-IDS: anomaly and signature based IDS for the inter- 102225.
net of things. J. Netw. Syst. Manag. 29 (23). doi:10.1007/s10922- 021- 09589- 6. Nazir, A., Khan, RA., 2020. A novel combinatorial optimization based feature selec-
Baig, Z., Salah, K., 2016. Distributed hierarchical pattern-matching for network in- tion method for network intrusion detection. Comput. Secur. 102, 102164.
trusion detection. J. Internet Technol. 17 (2), 167–178. doi:10.6138/JIT.2016.17.2. Mokhtari, S., Abbaspour, A., Yen, K.K., Sargolzaei, A.A, 2021. Machine learning ap-
20131021. proach for anomaly detection in industrial control systems based on measure-
Benferhat, S., Boudjelida, A., Tabia, K., Drias, H., 2013. An intrusion detection and ment data. Electronics 10 (4), 407.
alert correlation approach based on revising probabilistic classifiers using expert Ghaleb, F.A., Saeed, F., Al-Sarem, M., Al-rimy, B.A.S., Boulila, W., Eljialy, A.E.M.,
knowledge. Appl. Intell. 38 (4), 520–540. doi:10.1007/s10489- 012- 0383- 7. et al., 2020. Misbehavior-aware on-demand collaborative intrusion detection
Ilgun, K., Kemmerer, RA., 1995. State transition analysis: a rule-based intrusion de- system using distributed ensemble learning for VANET. Electronics 9 (9),
tection approach. IEEE Trans. Softw. Eng. 21 (3), 181–199. 1411.
Yin, L.B., 2019. National industrial information security development research cen- Upadhyay, D., Manero, J., Zaman, M., Sampalli, S., 2021. Gradient boosting feature se-
ter. Decoding: Industrial Cyber Security, 1th ed. Publishing House of electronics lection with machine learning classifiers for intrusion detection on power grids.
industry, Beijing. IEEE Trans. Netw. Serv. Manag. 18 (1), 1104–1116.
Jiang, J.C., Ma, H.T., Ren, D.E., Qing, S.H., 20 0 0. A survey of intrusion detection re- Mishra, P., Varadharajan, V., Pilli, E., Tupakula, U., 2020. VMGuard: a VMI-based
search on network security. J. Softw. 11, 1460–1466. security architecture for intrusion detection in cloud environment. IEEE Trans.
Luca, B., Marco, C., Mario, M., Enrico, M., Talha, N., Sandro, Z., 2017. Statistical finger- Cloud Comput. 8 (3), 957–971.
print-based intrusion detection system (SF-IDS). Int. J. Commun. Syst. 30 (10), Morfino, V., Rampone, S., 2020. Towards near-real-time intrusion detection for IoT
1–11. devices using supervised learning and apache spark. Electronics 9 (3), 444.
Nassif, A.B., Talib, M.A., Nasir, Q., Dakalbab, F.M., 2021. Machine learning for anomaly Amouri, A., Alaparthy, V.T., Morgera, S.D., 2020. A machine learning based intrusion
detection: a systematic review. IEEE Access 9, 78658–78700. detection system for mobile internet of things. Sensors 20 (2), 461.
Sun, R., Zhang, S., Yin, C., Wang, J., Minet, S., 2019. Strategies for data stream mining Karthik, M.G., Krishnan, M., 2021. Hybrid random forest and synthetic minority over
method applied in anomaly detection. Clust. Comput. 22, 399–408. sampling technique for detecting internet of things attacks. J. Ambient Intell.
Cañete-Sifuentes, L., Monroy, R., Medina-Pérez, M.A., 2021. A review and experimen- Humaniz. Comput. doi:10.1007/s12652- 021- 03082- 3.
tal comparison of multivariate decision trees. IEEE Access 9, 110451–110479. Karthikraja, C., Senthilkumar, J., Hariharan, R., Devi, G.U., Suresh, Y., Mohanraj, V.,
Hota, H.S., Shrivas, A.K., 2014. Decision tree techniques applied on NSL-KDD data 2022. An empirical intrusion detection system based on XGBoost and bidirec-
and its comparison with various feature selection techniques. Adv. Comput. tional long-short term model for 5G and other telecommunication technologies.
Netw. Inform. 1, 205–211. Comput. Intell. doi:10.1111/coin.12497.
Bagyalakshmi, C., Samundeeswari, E.S., 2020. DDoS attack classification on cloud en- Zhang, W.G., Zhang, R.H., Wu, C.Z., Goh, A.T.C., Lacasse, S., Liu, Z.Q., et al., 2020.
vironment using machine learning techniques with different feature selection State-of-the-art review of soft computing applications in underground excava-
methods. Int. J. Adv. Trends Comput. Sci. Eng. 9 (5), 7301–7308. tions. Geosci. Front. 11 (04), 1095–1106.
Umak, M.R., Mishra, R., 2014. An efficient modular approach of intrusion detection Bhattacharya, S., Krishnan, S.S.R., Maddikunta, P.K.R., Kaluri, R., Singh, S.,
system based on MSPSO-DT. Int. J. Adv. Res. Comput. Sci. 5 (3), 47–53. Gadekallu, T.R., et al., 2020. A novel PCA-firefly based XGBoost classification
Mahbooba, B., Timilsina, M., Sahal, R., Serrano, M., 2021. Explainable artificial intelli- model for intrusion detection in networks using GPU. Electronics 9 (2), 219.
gence (XAI) to enhance trust management in intrusion detection systems using Wang, X.L., Lu, X., 2020. A host-based anomaly detection framework using XGBoost
decision Tree model. Complex. 2021, 1–11. and LSTM for IoT devices. Wirel. Commun. Mob. Comput. 2020, 1–13.
Hearst, M.A., Dumais, S.T., Osman, E., Platt, J., Scholkopf, B., 1998. Support vector Bedi, P., Gupta, N., Jindal, V., 2021. I-SiamIDS: an improved Siam-IDS for handling
machines. IEEE Intell. Syst. Their Appl. 13 (4), 18–28. class imbalance in network-based intrusion detection systems. Appl. Intell. 51
Pan, Y.Q., Zhai, W.P., Gao, W., Shen, X.J., 2020. If-SVM: iterative factoring support (2), 1133–1151.
vector machine. Multimed. Tools Appl. 79 (35-36), 25441–25461. Qiao, N., Li, Z.X., Zhao, GS., 2022. Intrusion detection model of internet of
Chauhan, V.K., Dahiya, K., Sharma, A., 2019. Problem formulations and solvers in things based on XGBoost-RF. J. Chin. Mini Micro Comput. Syst. 43 (01), 152–
linear SVM: a review. Artif. Intell. Rev. 52 (2), 803–855. 158.
Meddeb, R., Jemili, F., Triki, B., Korbaa, O., 2019. Anomaly-based behavioral detection Kumar, P., Gupta, G.P., Tripathi, R., 2021. An ensemble learning and fog-cloud ar-
in mobile Ad-Hoc networks. Procedia Comput. Sci. 159, 77–86. chitecture-driven cyber-attack detection framework for IoMT networks. Comput.
Al-Qatf, M., Yu, L., Al-Habib, M., Al-Sabahi, K., 2018. Deep learning approach combin- Commun. 166, 110–124.
ing sparse autoencoder with SVM for network intrusion detection. IEEE Access Xu, M.F., Li, X.H., Wang, Y.W., Luo, B., Guo, J.J., 2020. Privacy-preserving multi-
6, 52843–52856. source transfer learning in intrusion detection system. Trans. Emerg. Telecom-
Shen, K., Parvin, H., Qasem, S.N., Tuan, B.A., Pho, K.H., 2020. A classification model mun. Technol. 32 (5), e3957.
based on SVM and fuzzy rough set for network intrusion detection. J. Intell. Abu Al-Haija, Q., Zein-Sabatto, S., 2020. An efficient deep-learning-based detec-
Fuzzy Syst. 39 (1), 1–17. tion and classification system for cyber-attacks in IoT communication networks.
Wang, S., Jin, Z.G., 2020. IDS classification algorithm based on fuzzy SVM model. Electronics 9 (12), 2152.
Appl. Res. Comput. 37 (02), 187–190. Zhou, X.K., Liang, W., Shimizu, S., Ma, J.H., Jin, Q., 2021. Siamese neural network
Feng, W.Y., Zhang, Q.L., Hu, G.Z., Huang, J.X.J., 2014. Mining network data for intru- based few-shot learning for anomaly detection in industrial cyber-physical sys-
sion detection through combining SVMs with ant colony networks. Futur. Gener. tems. IEEE Trans. Ind. Inf. 17 (8), 5790–5798.
Comput. Syst. 37, 127–140 Int. J. Escience. van Wyk, F., Wang, Y.Y., Khojandi, A., Masoud, N., 2020. Real-time sensor anomaly
Kabir, E., Hu, J.K., Wang, H., Zhuo, GP., 2018. A novel statistical technique for in- detection and identification in automated vehicles. IEEE Trans. Intell. Transp.
trusion detection systems. Futur. Gener. Comput. Syst. 79, 303–318 The Interna- Syst. 21 (3), 1264–1276.
tional Journal of Escience. Nie, L.S., Ning, Z.L., Wang, X.J., Hu, X.P., Cheng, J., Li, YK., 2020. Data-driven intru-
Pozi, M.S.M., Sulaiman, M.N., Mustapha, N., Perumal, T., 2015. Improving anomalous sion detection for intelligent internet of vehicles: a deep convolutional neural
rare attack detection rate for intrusion detection system using support vector network-based method. IEEE Trans. Netw. Sci. Eng. 7 (4), 2219–2230.
machine and genetic programming. Neural Process. Lett. 44 (2), 1–12. Jeong, S., Jeon, B., Chung, B., Kim, H.K., 2021. Convolutional neural network-based
Zhu, J., Hu, W.B., 2015. Recent advances in Bayesian machine learning. J. Comput. intrusion detection system for AVTP streams in automotive Ethernet-based net-
Res. Dev. 52 (01), 16–26. works. Veh. Commun. 29, 100338.
Koch, I., Naito, K., Tanaka, H., 2019. Kernel naive Bayes discrimination for high-di- Khan, MA., 2021. HCRNNIDS: hybrid convolutional recurrent neural network-based
mensional pattern recognition. Aust. N. Z. J. Stat. 61 (4), 401–428. network intrusion detection system. Processes 9 (5), 834.
Zhang, H.P., Cheng, N., Zhang, Y., Li, Z.B., 2021. Label flipping attacks against Naive Riyaz, B., Ganapathy, S., 2020. A deep learning approach for effective intrusion de-
Bayes on spam filtering systems. Appl. Intell. 51 (7), 4503–4514. tection in wireless networks using CNN. Soft Comput. 24 (22), 17265–17278.

16
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861

MS. Donghao Jia is a graduate student at North China Uni-


Andresini, G., Appice, A., Malerba, D., 2021. Nearest cluster-based intrusion detection
versity of Science and Technology of College of Science.
through convolutional neural networks. Knowl. Based Syst. 216, 106798.
His main research interests are Machine learning, Artifi-
Li, Y.M., Xu, Y.Y., Liu, Z., Hou, H.X., Zheng, Y.S., Xin, Y., et al., 2020. Robust detection
cial intelligence, Network Intrusion Detection.
for network intrusion of industrial IoT based on multi-CNN fusion. Measure-
ment 154, 107450.
He, M., Wang, X., Zhou, J., Xi, Y.Y., Jin, L., Wang, X.L., 2021. Deep-feature-based
autoencoder network for few-shot malicious traffic detection. Secur. Commun.
Netw. 2021 (6), 1–13.
Khan, S.H., Hayat, M., Bennamoun, M., Sohel, F.A., Togneri, R., 2018. Cost sensitive
learning of deep feature representations from imbalanced data. IEEE Trans. Neu-
ral Netw. Learn. Syst. 29 (8), 3573–3587.
Zhang, H.P., Huang, L.L., Wu, C.Q., Li, Z.B., 2020. An effective convolutional neural
network based on SMOTE and Gaussian mixture model for intrusion detection
in imbalanced dataset. Comput. Netw. 177, 107315. MS. Liya Wang was a Teaching Assistant with the College
Mulyanto, M., Faisal, M., Prakosa, S.W., Leu, JS., 2021. Effectiveness of focal loss for of Science, North China University of Science and Tech-
minority classification in network intrusion detection systems. Symmetry 13 (1), nology, where she has been a Lecturer, since 2018. She
4. is the author of two books and more than 30 articles. Her
M, S.K., Ben-Othman, J., Srinivasagan, K.G., Umarani, P., 2022. Machine learning research interests include concept lattice, social networks,
methods for enhanced cyber security intrusion detection system. Adv. Comput. data mining, network intrusion detection, and intelligence
Inform. Netw. Cybersecur. 733–754. algorithm.
Li, Z.Z., Hoiem, D., 2018. Learning without Forgetting. IEEE Trans. Pattern Anal. Mach.
Intell. 40 (12), 2935–2947.
Xiao, K., Liu, T.Y., Sun, X.Y., He, Y.H., Zeng, F.F., 2020. Intrusion detection method
based on incremental convolution neural network. J. Comput. Appl. 40 (S2),
73–79.
Alavizadeh, H., Alavizadeh, H., Jang-Jaccard, J., 2022. Deep Q-learning based rein-
forcement learning approach for network intrusion detection. Computers 11 (3), MS. Wenjie Wang is a graduate student at North China
41. University of Science and Technology of College of Sci-
Kwon, S., Yoo, H., Shon, T., 2020. IEEE 1815.1-based power system security with ence. Her main research interests are Machine learning,
bidirectional RNN-based network anomalous attack detection for cyber-physi- Artificial intelligence, Network Intrusion Detection.
cal system. IEEE Access 8, 77572–77586.
Oliveira, N., Praça, I., Maia, E., Sousa, O., 2021. Intelligent cyber attack detection
and classification for network-based intrusion detection systems. Appl. Sci. 11
(1674), 1674.
Xie, J., Li, S.H., Yun, X.C., Zhang, Y.Z., Chang, P., 2020. HSTF-Model: an HTTP-based
Trojan detection model via the hierarchical Spatio-temporal features of traffics.
Comput. Secur. 96, 101923.
Hsu, C.M., Azhari, M.Z., Hsieh, H.Y., Prakosa, S.W., Leu, J.S., 2021. Robust network
intrusion detection scheme using long-short term memory based convolutional
neural networks. Mob. Netw. Appl. 26 (3), 1137–1144.
Yao, R.Z., Wang, N., Liu, Z.H., Chen, P., Sheng, X.J., 2021. Intrusion detection system MS. Fengchun Liu received the BS in School of Infor-
in the advanced metering infrastructure: a cross-layer feature-fusion CNN-LST- mation Science and Technology from Liaoning University,
M-based approach. Sensors 21 (2), 626. and the MS in Institute of Chemical Industry from Hebei
Sun, P.F., Liu, P.J., Li, Q., Liu, C.X., Lu, X.L., Hao, R.C., et al., 2020. DL-IDS: extracting Polytechnic University. He is currently an associate pro-
features using CNN-LSTM hybrid network for intrusion detection system. Secur. fessor in the School of College of Qianan at North China
Commun. Netw. 2020, 8890306. University of Science and Technology. His research inter-
KDD Cup 1999 Data. October 28, 1999. Available from: http://kdd.ics.uci.edu. ests include data mining, Machine learning.
</Dataset>
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A., 2009. A detailed analysis of the
KDD CUP 99 data set. In: Proceedings of the IEEE Symposium on Computational
Intelligence for Security and Defense Applications, pp. 1–6. doi:10.1109/CISDA.
2009.5356528.

Dr. Chunying Zhang received the BS in School of Com- Dr. Aimin Yang is currently a professor at the School of
puter Science and Technology from Heilongjiang Univer- College of Science in North China University of Science
sity, and the MS and Ph.D. in Computer Application Tech- and Technology. His research interests include Intelligent
nology from Yanshan University of school of information algorithm design, steel big data and metallurgical intelli-
science and engineering. She is currently a professor at gent manufacturing, medical big data and AI-assisted di-
the School of College of Science in North China Univer- agnosis and treatment system development, ship multi-
sity of Science and Technology. She is a member of China attribute intelligent decision-making.
Computer Federation (CCF). Her research interests include
data mining, rough set and social network analysis.

17

You might also like