Comparative Research On Network Intrusion Detection Methods Based
Comparative Research On Network Intrusion Detection Methods Based
a r t i c l e i n f o a b s t r a c t
Article history: Network intrusion detection system is an essential part of network security research. It detects intru-
Received 23 February 2022 sion behaviors through active defense technology and takes emergency measures such as alerting and
Revised 2 July 2022
terminating intrusions. With the rapid development of machine learning technology, more and more re-
Accepted 24 July 2022
searchers apply machine learning algorithms to network intrusion detection to improve detection effi-
Available online 28 July 2022
ciency and accuracy. Due to the different principles of various algorithms, they also have their advan-
Keywords: tages and disadvantages. To construct the dominant algorithm model in the field of network intrusion
Network intrusion detection detection and provide the accuracy value, this paper systematically combs the application literature of
Machine learning machine learning algorithms in intrusion detection in the past ten years. A review is made from three cat-
Deep learning egories: traditional machine learning, ensemble learning, and deep learning. Then, this paper selects the
Comparative experiment KDD CUP99 and NSL-KDD datasets to conduct comparative experiments on decision trees, Naive Bayes,
support vector machines, random forests, XGBoost, convolutional neural networks, and recurrent neu-
ral networks. The detection accuracy, F1, AUC, and other indicators of these algorithms on different data
sets are compared. The experimental results show that the effect of the ensemble learning algorithm is
generally better. The Naive Bayes algorithm has low accuracy in recognizing the learned data, but it has
obvious advantages when facing new types of attacks, and the training speed is faster. The deep learning
algorithm is not particularly prominent in this experiment, but its optimal results are affected by the
structure, hyperparameters, and the number of training iterations, which need further in-depth study. Fi-
nally, the main challenges facing the current network intrusion detection field are summarized, and the
future research directions have been prospected.
© 2022 Elsevier Ltd. All rights reserved.
1. Introduction itoring data analysis report shows that in the first half of 2020,
there were about 19,0 0 0 counterfeit pages on domestic websites,
The wide application of information technology and the rise about 18,0 0 0 overseas IP addresses implanted backdoors on about
and development of cyberspace have promoted the prosperity and 35,900 websites in China, and about 74,0 0 0 websites have been
progress of the economy and society, but they have also brought tampered with within China, including 318 government websites
new security risks and challenges. Cyberspace security concerns that have been tampered with. The number of DDoS attacks on
the common interests of humanity, world peace and development, the cloud platform accounted for 76.1% of the domestic targets
and national security of all countries (Liu, 2017). The advent of attacked. The number of implanted backdoor links accounted for
the digital age means that everything will be interconnected, and 90.3% of all domestically planted backdoor links. The number of
the entire world will be built on software. Where there is Inter- industrial equipment exposed on the Internet reached 4630, and
net software, there will be loopholes, network infrastructure will it continued to be scanned and sniffed more than 20,0 0 0 times a
become more complex, and the attack surface will expand in- day from abroad (National Computer Network Emergency Response
finitely. Network security attacks will show advanced and large- Technical Team/Coordination Center of China (CNCERT/CC) 2020).
scale characteristics. All financial and technological wars are, in the It can be seen that cyberspace security is facing a considerable
final analysis, network wars. China Internet network security mon- threat, and it is necessary to increase efforts to improve technol-
ogy for efficient defense. There are two main types of network se-
curity technologies. One is passive prevention technology such as
∗
Corresponding author.
data encryption, identity authentication, firewall, etc. The other is
E-mail address: [email protected] (L. Wang). active prevention technology, mainly network intrusion detection,
https://doi.org/10.1016/j.cose.2022.102861
0167-4048/© 2022 Elsevier Ltd. All rights reserved.
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
which can detect some abnormal network behaviors and deal with analysis and compares these algorithms through experiments. The
them accordingly. Active defense technology based on network in- main contributions of this paper are divided into the following
trusion detection has become an important research topic in net- points.
work security.
The network intrusion detection system is divided into two 1 This review introduces the application of common machine
modes according to the behavior of intrusion detection, anomaly learning algorithms in intrusion detection in the past ten years.
detection and signature detection. Anomaly detection needs to es- It provides solutions and ideas for subsequent research on
tablish a model for accessing normal behaviors and determine problems such as data imbalance and incremental learning in
behaviors that do not conform to this model as intrusions. The intrusion detection.
core of this detection method lies in how to define the so-called 2 This review conducts comparative experiments on several types
"normal" situation. On the contrary, signature detection needs to of machine learning algorithms in intrusion detection and pro-
summarize all possible unfavorable and unacceptable behaviors to vides data support for constructing the dominant algorithm
establish a model. The behavior that conforms to this model is model of intrusion detection.
judged as an intrusion. This method mainly evaluates whether the 3 This review discusses the challenges facing the intrusion detec-
event characteristics that violate the security policy are in the col- tion field and directions for future development.
lected data. The core technology is to maintain a knowledge base
(Wu, 2019). 2. Intrusion detection system and application
Machine learning is a science of artificial intelligence, which
mainly studies how to use data or previous experience to achieve Intrusion Detection System (IDS) refers to a series of devices or
algorithm classification and prediction (Xin et al., 2018). Its the- software that play an essential role in combating intrusions and
ories and methods have been widely used to solve complex prob- malicious behavior in modern organizations. Its function is not
lems in engineering applications and scientific fields, such as Natu- to eliminate but to guard against network attacks. It determines
ral language understanding, non-monotonic reasoning, machine vi- whether anomalies occur by detecting traffic or logs. If there is
sion, pattern recognition, text understanding, sentiment analysis, any abnormality, it will send an alarm to the system’s manage-
image retrieval and understanding, and analysis of graph and net- ment unit. The intrusion detection system has a proactive security
work data, etc. With the development of machine learning technol- defense technology to identify intrusion behavior. If an intrusion
ogy and the increasing complexity of network intrusion detection behavior has occurred, it will actively respond and take measures
problems, many researchers have applied machine learning algo- to prevent it and collect intrusion evidence (Kolandaisamy et al.,
rithms to network intrusion detection systems and achieved good 2020). It is essential to use and improve the intrusion detection
detection results (Gumusbas and Cybersecurity, 2022). The method system. There is no safe system in the world. Target systems are
based on a machine learning algorithm is mainly used in abnor- often attacked by two types of users (Williamson, 2003). One is
mal intrusion detection systems, and its basic process is shown in a legitimate user part of the system but exceeds the trust range
Fig. 1. The most important part of the process is to define the anal- specified by the system when used. The other is an illegal user
ysis goals, because the data and models required to study different who is not identified by the system but operates or attacks the
intrusion detection problems will differ. target system.
Many researchers have improved the machine learning algo-
rithm and introduced the improved method into the network in- 2.1. Common intrusion detection framework
trusion detection system, and proposed many practical defense
technology algorithms, models, and systems (Handa et al., 2019). In April 1980, Anderson (Anderson, 1980) first proposed the
This paper selects the algorithms of machine learning in network concept of intrusion detection. After that, Lunt et al. (1988) pro-
intrusion detection systems in the past ten years for overview and posed an expert system for intrusion detection. In 1986,
2
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
Denning (1987) proposed an intrusion detection model consist- 2.2.1. Signature detection
ing of six parts: objects, subjects, abnormal records, activity rules, Signature detection is an intrusion detection method based
audit records, and behavior profiles, which received extensive at- on pattern matching. It identifies and characterizes the intru-
tention. Kahn et al. (1998) proposed a Common Intrusion Detec- sion behavior and then constructs a database. When the data in
tion Framework (CIDF) after integrating various intrusion detection the network activity information matches the characteristics con-
models. It has a simple structure and is divided into four parts, tained in the database, the intrusion detection system will issue
which are event generator, event analyzer, response unit, and event a warning. The basic model is shown in Fig. 4. There are many
database, as shown in Fig. 2. signature-based intrusion detection methods, including pattern
matching method (Baig and Salah, 2016), expert system method
(Benferhat et al., 2013), state transition analysis method (Ilgun and
2.2. Intrusion detection system classification based on detection Kemmerer, 1995), and so on. The current technology of rule-based
method intrusion detection is relatively mature, and many mature commer-
cial products have appeared. Intrusion detection systems such as
The intrusion detection system can be divided into host- Modsecurity have been widely used.
based and network-based detection according to different data The main advantages of signature intrusion detection technol-
sources. According to the different detection methods, it can be di- ogy are: (1) only need to collect relevant data, the system burden
vided into signature detection and anomaly detection (Otoum and is small; (2) similar to the virus detection system, the accuracy and
Nayak, 2021). A variety of different algorithms can be used under efficiency are relatively high; (3) the technology is relatively ma-
different detection methods, as shown in Fig. 3. ture.
3
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
4
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
Table 1
Summary of intrusion detection based on traditional machine learning.
Hota and A comprehensive and comparative analysis of the Using the NSL-KDD dataset, the C4.5 algorithm Decision Tree
Shrivas (2014) decision trees. works best, and when using the Info Gain feature
selection technology, its accuracy reaches 99.68%.
Bagyalakshmi and DDoS attack classification on cloud environment The decision tree technology based on learning
Samundeeswari (2020) using machine learning techniques with different vector quantization has the best classification
feature selection methods. effect, with an accuracy of 98.74%.
Umak and An efficient modular approach of intrusion Based on the MSPSO-DT system is improved
Mishra (2014) detection system based on MSPSO-DT. intrusion detection accuracy more than 99.43%,
speedup testing time is 11.13 sec.
Mahbooba et al. (2021) The importance of features based on entropy The concept of explainable artificial intelligence is
measure for intrusion detection is studied based proposed, and the intrusion classification rules
on decision trees. extracted from the decision tree method are
explained, which provides a reference for
subsequent research on the interpretation of
more complex deep learning models.
Meddeb et al. (2019) A set of IDS nodes is proposed to supervise and On the constructed new dataset, the SVM-based SVM
analyze the behavior of the observed system, and intrusion detection model achieves high-precision
collect and construct training data. detection for ad hoc networks.
Al-Qatf et al. (2018) Using the SAE algorithm to reconstruct the Accuracy reaches 95.09% and 84.96% on KDD99
dataset. and NSL-KDD datasets.
Shen et al. (2020) Feature extraction from data using fuzzy rough Realize high-precision and fast identification,
set theory. reduce system burden, and achieve 99.84%
accuracy on NSL-KDD data set.
Wang and Jin (2020) SVM is used for initial classification, and then a For small scale data, the two-level intrusion
density-based clustering algorithm is used for detection algorithm combined with fuzzy support
sub-classification. vector machine and DBSCAN achieves high
classification accuracy while maintaining fast
training speed.
Feng et al. (2014) A new algorithm CSVAC is proposed, which Experiments are performed using the KDD99
combines the modified SVM and CSOACN. dataset, and the model implements incremental
learning.
Kabir et al. (2018) An optimal allocation-based least squares support The model achieves incremental learning without
vector machine (OA-LS-SVM) intrusion detection reducing the generalization ability of LS-SVM,
model. while reducing training time.
Pozi et al. (2015) An intrusion detection model integrating genetic The model has good generalization and can be
programming and support vector machine used to discover new types of network attacks.
algorithm.
Yao et al. (2015) Intrusion detection model based on decision tree Alleviates the constraints of Naive Bayes Naïve Bayes
and Naïve Bayes classification. conditional independence and avoids the problem
of overfitting of decision trees.
Wang et al. (2014) Improved Naive Bayes algorithm. The TrS and TeS of the KDD99 intrusion detection
data set are used for simulation experiments, and
the accuracy is 98.21%.
Zhang et al. (2018) The KDD99 dataset was preprocessed using IPCA Reduce model training and testing time.
and detected using the Gaussian Naive Bayes
algorithm.
Gu and Lu (2021) A Naive Bayes algorithm using feature embedding Convert low-quality data into high-quality data
ideas. with obvious feature categories.
gorithms to improve the detection rate, so there is little room for NetBigData (DoS attack) for in-vehicle ad hoc networks and
development in the future. achieved high-precision intrusion detection for ad hoc networks.
Al-Qatf et al. (2018) proposed the SAE-SVM intrusion detection
2.3.2. Support vector machine learning model model, which used the SAE algorithm to reconstruct the data set
Support Vector Machine (SVM) is a binary classification so that the new data set created could have good characteris-
algorithm developed from the generalized portrait algorithm tics. Shen et al. (2020) proposed the FRST-SVM model, which used
(Hearst et al., 1998). It took nearly 40 years from its emergence fuzzy rough set theory to extract features from data. The SAE-SVM
to the more mature modern SVM concept proposed by Vapnik and FRST-SVM models achieve high-precision and rapid recogni-
et al. SVM has not only a good classification effect but also good tion, reducing the system burden.
portability, stability, and robustness (Pan et al., 2020). It is a bi- SVM can not only detect intrusion data but also be applied
nary linear classifier. Its ability to deal with large-scale data and to intrusion detection as a data processing algorithm. The intru-
multi-classification problems is weak in the case of unimproved sion detection model proposed by Wang and Jin (2020) uses SVM
(Chauhan et al., 2019), so improving the SVM algorithm has be- to initially classify the data set and then uses the classification
come the focus of research. result as the input of the density-based clustering algorithm for
(1) Research on SVM intrusion detection based on data opti- sub-classification. Although the experimental results are good, the
mization. model is complex and proposed based on a small dataset. When
Although the detection effect of SVM is excellent, in the the amount of data is enormous, it will expose the drawbacks of
case of a large amount of data and high dimension, its train- the long training time of SVM, so it is not universal.
ing time is longer than other algorithms. Therefore, many schol- (2) Research on incremental SVM intrusion detection.
ars have optimized the dataset to better use SVM classifi- The principle of the SVM algorithm classification is to find the
cation. Meddeb et al. (2019) collected and built a database optimal hyperplane and classify the data according to the optimal
5
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
hyperplane. The construction of the optimal hyperplane does not tage of this model is that it relieves the constraints of Naive Bayes
require all support vectors and can be drawn only according to the conditional independence and avoids the problem of overfitting the
support vectors of the edges. When new data is judged as an edge decision tree.
vector, it can be classified to achieve incremental learning. Wang et al. (2014) proposed an improved Naive Bayes algorithm
To better detect dynamic data, many scholars have improved for the problem of accuracy degradation caused by different fea-
the traditional SVM algorithm. For example, the CSVAC model ture attribute sample nodes. This algorithm based on the origi-
(Feng et al., 2014) incorporates a clustering based on a self- nal model is combined with a parameter of classification control.
organizing ant colony training network (CSOACN) algorithm based Adjust and select optimal parameters to improve detection accu-
on active support vector machines. This model has the advantages racy. The experiment uses the training set and test set provided by
of SVM and CSOACN at the same time. SVM has a better classifica- KDD99 to achieve multi-class detection, and the detection results
tion effect. CSOACN can generate an adaptive model to train new are above 97%. It can be seen that the model has strong general-
category data and realize incremental learning based on fast detec- ization while improving the accuracy.
tion. Compared with other classification algorithms, the detection ac-
Kabir et al. (2018) proposed an optimal allocation-based least curacy of the Naive Bayes algorithm is not optimal, but it can
squares support vector machine (OA-LS-SVM) intrusion detection achieve fast classification. Zhang et al. (2018) took full advantage
model for new data detection. The model uses the idea of sam- of the fast classification speed of the Naive Bayes algorithm. At the
pling in statistics and can extract representative samples from the same time, to further shorten the training time and improve the
old and new data sets as the input of the detection algorithm accuracy, the authors first used IPCA to preprocess the KDD99 data
least squares support vector machine (LS-SVM). The experimental set and then used the Gaussian Naive Bayes algorithm for detec-
results show that this model does not reduce the generalization tion. From the experimental results, when the detection accuracy
ability of LS-SVM while implementing incremental learning. At the is 91%, It only takes 0.562 s to complete the training model and
same time, the training time is significantly reduced due to the use test data. This result is of great significance for the real-time mon-
of sampling ideas. itoring research of intrusion detection.
Pozi et al. (2015) defined the missing and newly added attack (2) Research on intrusion detection based on Naive Bayes fea-
data in the training set as unusually rare attacks. To improve the ture extraction.
detection rate of such attacks, the authors proposed an intrusion The Naive Bayes algorithm can not only be used as a classifier
detection model combining genetic programming and an SVM al- in intrusion detection research but also can be used to process in-
gorithm. Many experiments in the literature can prove that the de- trusion data. Gu and Lu (2021) proposed a Naive Bayes algorithm
tection rate does not decrease significantly when rare abnormal at- utilizing the idea of feature embedding to deal with the problem
tacks occur. This also shows that the algorithm has good general- of similar characteristics of normal and abnormal data and convert
ization and can be used to discover new types of network attacks. low-quality data into high-quality data with prominent feature cat-
It can be known from the above literature that the structure egories that are easy to classify. It is convenient for the classifier to
of the improved intrusion detection system is complex, and the detect.
support vector machine is difficult to deal with large-scale data. The research on traditional machine learning algorithms such as
This makes the enhanced system inefficient in solving the intru- Naive Bayes and decision trees in intrusion detection has entered
sion detection problem, and the optimization and improvement of a mature stage very early. In recent years, these algorithms have
the model still need further research. Although a support vector often appeared in the literature as part of data processing.
machine has many problems, it has potent portability and high sta-
bility, and it is easy to realize algorithm improvement research. It
2.4. Application of ensemble learning in the field of intrusion
has always been the mainstream algorithm to solve the problem of
detection
intrusion detection. Therefore, learning support vector machines is
an essential process for gaining a deep understanding of intrusion
In machine learning, different algorithms have different advan-
detection problems.
tages and disadvantages. To avoid the algorithm’s preference when
dealing with different data sets, different weakly supervised learn-
2.3.3. Naive Bayes learning model
ing algorithms can be combined to form a strongly supervised
The Naive Bayes algorithm is derived from Bayes’ theorem and
learning algorithm, that is, an ensemble learning algorithm. There
has a solid mathematical foundation. It adds to the Bayesian ap-
are many kinds of ensemble learning models. This paper mainly in-
proach the assumption that different attributes are independent of
troduces the application of random forest, XGBoost, and their im-
each other. But this assumption does not hold in reality, and it
proved algorithms in intrusion detection, which is summarized in
will inevitably affect the results of the Naive Bayes classification
Table 2.
(Zhu and Hu, 2015). However, compared to the higher complexity
of the Bayesian algorithm, the impact of Naive Bayes on the re-
sults is acceptable. The principle of the Naive Bayes classification 2.4.1. Random forest
algorithm is simply to calculate the posterior probability according In 20 01, Breiman (20 01) proposed a machine learning algo-
to the prior probability and then judge which category it belongs rithm, Random Forest, that fuses Bagging ensemble learning the-
to according to the posterior probability. Today, Naive Bayes algo- ory and the random subspace method. Multiple decision trees are
rithms have been applied in various fields, such as pattern recog- trained by Bagging ensemble learning technology and combined
nition (Koch et al., 2019), spam classification (Zhang et al., 2021), into random forests. When dealing with data classification prob-
intrusion detection (Yao et al., 2015), etc. lems, the classification results are determined by the mode of the
(1) Research on intrusion detection based on Naive Bayes clas- results obtained by all decision trees.
sification. (1) Based on the application of improved random forest in the
Yao et al. (2015) proposed an intrusion detection model based field of intrusion detection.
on decision trees and Naive Bayes. This model is an ensemble algo- The classification model (T-SNERF) proposed by
rithm. After the data is processed, the decision tree classifier and Hammad et al. 92021), combined with t-distributed random
the Naive Bayes classifier are used for detection, respectively. Then neighborhood embedding and random forest algorithm, achieves
the two results are evaluated to obtain the final result. The advan- 100% recognition accuracy on the UNSW-NB15 intrusion detec-
6
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
Table 2
Summary of intrusion detection based on ensemble learning.
Hammad et al. (2021) A T-SNERF intrusion detection model. Accuracy of 100%, 99.7878%, and 99.7044% on Random Forest
UNSW-NB15, CICIDS-2017, and Phishing datasets.
Iwendi et al. (2020) Classification method using CFS + ensemble The accuracy rate of multiple groups of
algorithm. experiments is above 99%.
Boahen et al. (2021) Network anomaly detection in a controlled Accuracy rates of 98.96% and 98.56% on
environment based on an enhanced PSOGSARFC. UNSW-NB15 and NSL-KDD datasets.
Morfino and Compares multiple machine learning algorithms Random forest achieves 100% accuracy on the
Rampone (2020) used in IoT network intrusion detection systems. SYN-DOS attack dataset.
Amouri et al. (2020) Process the data collected by the sniffer from the Realize network intrusion detection based on the
MAC layer and the network layer using the Internet of Things.
random forest algorithm.
Karthik and Hybrid random forest and synthetic minority Through comparative experiments, it is shown
Krishnan (2021) over-sampling technique for detecting Internet of that the RF-SMOTE model performs well in the
Things attacks. field of IoT network intrusion detection.
Bhattacharya et al. (2020) A Novel PCA-Firefly based XGBoost classification Through multiple sets of comparative XGBoost
model for intrusion detection in networks. experiments, it is proved that the XGBoost-PCA
and Firefly model performs well.
Wang and Lu (2020) Intrusion detection model based on the coupling XGBoost-LSTM models outperform XGBoost or
of XGBoost and LSTM model. LSTM models alone.
Bedi et al. (2021) An improved Siam-IDS for handling class The model can deal with the class imbalance
imbalance in Network-based Intrusion Detection problem and comprehensively obtain the attack
Systems. types in the intrusion detection data set.
Qiao et al. (2022) An XGBoost-RF intrusion detection model. The model has obvious advantages in solving the
problem of imbalanced data.
Kumar et al. (2021) An ensemble learning and fog-cloud The model implements fog computing-based
architecture-driven cyber-attack detection intrusion detection, with a detection rate of
framework for IoMT networks. 99.98% on the ToN-IoT dataset.
Xu et al. (2020) An XGBoost Intrusion Detection Scheme Based on On the premise that the detection results are not
Privacy Protection. greatly reduced, privacy protection can be
achieved.
tion dataset. It also achieves 99.7878% and 99.7044% on the work intrusion detection system, using the intrusion data targeted
CICIDS-2017 and Phishing datasets, respectively. at the Internet of Things, namely the SYN-DOS attack. All super-
Iwendi et al. (2020) proposed to use CFS to reduce the dimen- vised machine learning algorithms used in the experiments are in
sionality of the dataset and then experimented with Bagging Ran- Apache Spark’s MLlib library, and experiments are performed in a
dom Forest and Adaboost Random Forest. The results show that cloud environment. The experimental results show that the ran-
both models achieve over 99% accuracy on KDD99 and NSL-KDD dom forest has the best effect, the accuracy rate reaches 100%, and
datasets. the time required for training and testing in the cloud environment
Boahen et al. (2021) proposed the PSOGSARF intrusion detec- is significantly reduced.
tion model. The dataset was processed using PSOGSA and detected The Internet of Things is a newer network paradigm. The op-
using the random forest algorithm. erating systems used by many Internet of Things devices are rel-
These experimental results demonstrate the improved accuracy atively lightweight, which makes it impossible to install and use
and reduced training time benefit from extensive dataset pro- large and complex antivirus software. Therefore, it is necessary to
cessing before classification. Random forest is an ensemble algo- develop an intrusion detection system that can be applied to In-
rithm based on decision trees. In the face of unseen data, random ternet of Things. Amouri et al. (2020) proposed a cross-layer IoT
forest classifiers generally outperform decision trees (Nazir and network intrusion detection method, using a random forest algo-
Khan, 2020). In addition to parameter changes, few scholars have rithm to process the data collected by the sniffer from the MAC
improved the structure of random forests in recent years, and most layer and the network layer and send them centrally to the super
have achieved algorithm fusion improvements. At the same time, node. Karthik and Krishnan (2021) proposed an IoT network attack
because of its excellent classification ability, it has been widely detection model based on random forest and synthetic minority
used in various fields of intrusion detection research. oversampling technology. The authors used NSL-KDD and N-BaIoT
(2) Application of random forest in the field of IoT intrusion de- datasets and conducted experiments with SVM, decision tree, and
tection. random forest. The results show that the RF-SMOTE model has the
In recent years, with the development of 5G technology, many best detection effect.
devices can be controlled through wireless networks, realizing the
interconnection of everything, which is the origin of the concept
of the Internet of Things. With the continuous development of the 2.4.2. XGBoost
Internet of Things, it has now grown in many fields, such as in- The full name of XGBoost (Karthikraja et al., 2022) is eXtreme
dustrial control systems (Mokhtari et al., 2021), uncrewed vehicles Gradient Boosting, an optimized distributed boosting library with
(Ghaleb et al., 2020), intelligent grids (Upadhyay et al., 2021), cloud high efficiency, flexibility, and convenience, which was summa-
security (Mishra et al., 2020), etc. As a new network paradigm, the rized and proposed by Chen based on previous research. Machine
distributed nature of the Internet of Things threatens its security learning algorithms are implemented under the gradient boosting
significantly. framework. During gradient boosting, each subsequent tree is con-
Because of its complexity, randomness, accuracy, and other structed sequentially to minimize the error of the previous tree
characteristics, random forest is suitable as a core algorithm for the (Zhang et al., 2020).
Internet of Things. Morfino and Rampone (2020) compared several (1) Intrusion detection system based on improved XGBoost al-
machine learning algorithms used in the Internet of Things net- gorithm.
7
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
XGBoost algorithm can prevent overfitting and improve accu- (3) Privacy-preserving intrusion detection system based on XG-
racy and algorithm efficiency in intrusion detection research. How- Boost.
ever, with the development of the Internet and the advent of With the expansion of the network scale, there are more and
the era of big data and the Internet of Things, network security more types and quantities of network intrusion detection data.
is facing more and more new technical challenges. Based on the Training with a single source of intrusion detection data is prone
XGBoost algorithm, researchers have made improvements to deal to overfitting problems. To this end, the internationally renowned
with various practical problems. security agency proposed the Managed Detection and Response
Facing the problem of low detection efficiency due to the (MDR) scheme, which trains the detection model by uploading
vast data set of network intrusion detection, an XGBoost intru- data from multiple sources to the cloud. However, this scheme
sion detection model based on the PCA-FA algorithm was pro- has problems such as high computational cost and extended train-
posed by Bhattacharya et al. (2020). The data is encoded, nor- ing time. At the same time, the development of the Internet has
malized, and dimensionally reduced before the XGBoost classifi- brought new challenges to privacy protection. The US Prism pro-
cation algorithm runs. The experimental results show that com- gram and the exposure of some Chinese companies to collect user
pared with other algorithms, the XGBoost-PCA and Firefly model is information through APP software and use it for profit have made
optimal. people pay more and more attention to privacy protection.
As mentioned above, the Internet of Things uses a lightweight To solve the privacy problem, an intrusion detection scheme
operating system with limited computing resources on devices. based on encrypted XGBoost was proposed by Xu et al. (2020).
Facing this dilemma, Wang and Lu (2020) proposed a host-based The final model for detecting target data is obtained by integrating
intrusion detection model, which coupled XGBoost with the LSTM the encrypted detection model uploaded to the cloud. This scheme
model. The data set is the system call sequence collected from the not only achieves privacy protection but also dramatically reduces
device. The reason for using LSTM is that it is suitable for pro- the training time on the premise of slightly decreasing the detec-
cessing sequence data with long-term dependencies. Experiments tion results. However, the future will be the era of supercomputing,
show that the XGBoost-LSTM stacking model is superior to the XG- especially the acceleration of cloud computing. The intrusion de-
Boost or LSTM model alone. tection scheme based on encrypted XGBoost still needs to be im-
The network traffic data is enormous, mostly normal traffic, and proved in terms of detection rate. Otherwise, even if privacy pro-
only a few are malicious activities. Models trained with such im- tection is achieved, it will still be eliminated if there is little differ-
balanced datasets tend to impact detection results. To solve this ence in training time.
problem, some scholars raised the proportion of minority traffic in
the total data. Although the detection results are improved, the
2.5. The application of deep learning in the field of intrusion
efficiency of the model is not improved. Bedi et al. (2021) pro-
detection
posed an intrusion detection system based on a two-layer en-
semble model (I-SiamIDS). The first layer integrates b-XGBoost,
As the latest branch in machine learning - deep learning has
Siamese neural network, and deep neural network to identify at-
achieved great success in recent years, especially in the areas of
tack data. The attack data obtained in the first layer is input into
speech recognition, image classification, big data, etc. It solves
the m-XGBoost model of the second layer, and finally, a more com-
more practical problems while maintaining high predictive abil-
prehensive attack type is obtained.
ity and accuracy. This paper mainly summarizes the application of
(2) Intrusion detection system combined with XGBoost and ran-
CNN and RNN in intrusion detection, outlined in Table 3.
dom forest.
Although both XGBoost and random forest belong to ensemble
learning, with the development of the network, the intrusion de- 2.5.1. Research on intrusion detection based on CNN
tection model constructed by a simple ensemble algorithm is diffi- In recent years, researchers have used CNN algorithms to
cult to deal with complex network attacks. Research is found that achieve intrusion detection research in various fields, such as the
the results of the fusion or integration of different ensemble learn- Internet of Things (Abu Al-Haija and Zein-Sabatto, 2020), industrial
ing algorithms into new models are often better. The following in- control systems (Zhou et al., 2021), connected autonomous vehi-
troduces several intrusion detection models that combine random cles (van Wyk et al., 2020; Nie et al., 2020), and in-vehicle self-
forests and XGBoost. organizing networks (Jeong et al., 2021) and so on, and achieved
Qiao et al. (2022) proposed the XGBoost-RF model to deal with relatively good results. CNN has a strong recognition ability mainly
imbalanced data. First, the XGBoost algorithm is used to score rel- because its convolution layer has a strong ability to extract features
atively essential features in the data set, and then the random for- from data. Therefore, many scholars will first use CNN to extract
est algorithm with updated weights is used for attack detection. features from the data and then use other classifiers to make de-
The advantage of this model is that through feature selection and cisions when algorithm fusion improvements. For example, in the
weight adjustment, a small number of important data can be fully hybrid intrusion detection framework based on deep learning pro-
trained, as shown in Fig. 6. Compared with the two-layer detection posed by Khan (Khan, 2021), the role of the convolutional neural
model (I-SiamIDS) mentioned in (Bedi et al., 2021), this model per- network is to obtain local features through convolution. Riyaz and
forms better. Ganapathy (2020) used the linear correlation coefficient and con-
Kumar et al. (2021) proposed an intrusion detection model ditional random field to obtain data features and then used convo-
based on fog computing to deal with IoT network security issues. lution to extract the optimal features in the feature data for classi-
The model structure is divided into two layers. Decision trees, fication.
Naive Bayes, and random forests are used as the first-level learner, (1) Data preprocessing based on CNN intrusion detection.
and then XGBoost is used for centralized classification. The authors As the most commonly used and fastest-growing algorithm in
conduct a large number of experimental comparisons with other deep learning, the convolutional neural network is usually used to
algorithms. The experimental results show that the model is supe- process two-dimensional or multi-dimensional data such as pic-
rior to other algorithms, and the optimal results are all above 99%. tures and speech. It is often inferior to other machine learning
The above example shows that the intrusion detection model com- algorithms for one-dimensional feature data sets. Faced with this
bining random forests and XGBoost has obvious advantages com- problem, researchers transform the network intrusion detection
pared with other algorithms. data stream into two-dimensional data to train the CNN model
8
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
Table 3
Summary of intrusion detection based on deep learning.
Li et al. (2020) Convert intrusion detection data to grayscale. Implementation of using CNN to process CNN
one-dimensional intrusion detection data.
He et al. (2021) Use CNN to pre-train on a small number of abnormal For unbalanced data, the detection accuracy is
behavior samples. effectively improved.
Zhang et al. (2020) An intrusion detection model based on SGM-CNN. Through multiple sets of comparative experiments, it
is proved that the SGM-CNN model performs well for
imbalanced data problems.
Mulyanto et al. (2021) A network intrusion detection system based on focus Reducing the weight of easy-to-classify samples
loss. effectively improves the detection rate of minority
samples.
Xiao et al. (2020) Intrusion detection model based on multi-core CNN. The use of controlled units in the convolutional layer
has experimentally proved that incremental learning
can be better achieved.
Oliveira et al. (2021) The data is reorganized in chronological order and On the dataset CIDDS-001, the accuracy reaches LSTM
detected using LSTM. 99.94%, and the f1-score is 91.66%.
Xie et al. (2020) An HTTP-based Trojan Detection Model via the In detecting HTTP Trojans, HSTF-Model is significantly
Hierarchical Spatio-Temporal Features of Traffics. higher than other algorithms, and the accuracy
reaches 99.98% and 91.41% on the BTHT-2018 and
ISCX-2012 datasets.
Hsu et al. (2021) Robust network intrusion detection scheme using 99.68% accuracy on NSL-KDD dataset.
long-short term memory based convolutional neural
networks.
Yao et al. (2021) A cross-layer feature-fusion CNN-LSTM-based Compared with the CNN and LSTM models, the loss
approach. of feature data is avoided, and the accuracy is 99.95%
and 99.79% on the KDD Cup 99 and NSL-KDD
datasets.
Sun et al. (2020) Extracting features using CNN-LSTM hybrid network The accuracy is 98.67% on the CICIDS2017 dataset.
for intrusion detection system.
(Andresini et al., 2021). For example, Li et al. (2020) divided the in- encoder for training, which improves the accuracy of anomaly de-
trusion detection data into four parts according to the correlation tection compared to using only the auto-encoder to train normal
then transformed the four parts of the data into grayscale images samples. In this model, CNN is not the final classifier but plays an
and introduced them into the multi-convolutional neural network auxiliary role.
for intrusion detection research. CNN can also be used as a classifier in the intrusion detection
(2) Research on CNN intrusion detection based on unbalanced model to deal with the problem of imbalanced data. Unbalanced
data. data can be dealt with in two ways: data-level technology and
As Internet access and connections increase, so do the types of algorithm-level technology (Khan et al., 2018). The data-level tech-
cyberattacks. Analysis of various intrusion detection data proves a nology is to balance the data before detection. The specific process
significant gap in the number of different kinds of attacks, and the is to under-sample the majority class and over-sample the minority
number of normal and abnormal is also different. Therefore, the class.
intrusion detection data has a data imbalance problem. This prob- For example, Zhang et al. (2020) proposed a CNN intrusion de-
lem is one of the essential reasons it is difficult for intrusion de- tection model SGM-CNN that combines SMOTE and Gaussian mix-
tection models to improve their performance. Some scholars have ture model-based under-sampling technique for imbalanced data.
explored the use of CNN to solve the problem of data imbalance. To test the model’s performance, the authors compared different
He et al. (2021) used an autoencoder to select normal sam- classifiers and data-level balanced sampling techniques on multi-
ples to train an anomaly detection model. To cope with imbal- ple data sets. The experiments show that the SGM-CNN model is
anced data, CNN is used to pre-train a small number of abnormal the best overall. Algorithmic-level techniques do not require bal-
behavior samples. The extracted features are input into the auto- anced preprocessing of the data and focus on a small number of
9
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
10
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
11
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
Table 5 The above evaluation indicators are the most commonly used in
Confusion matrix.
evaluation algorithm models, but they still have shortcomings. For
Predict example, in the case of unbalanced data, these evaluation indica-
Actual normal abnormal tors cannot reflect the predictive ability of the model well.
normal TN FP Therefore, a new evaluation index AUC was added to the exper-
abnormal FN TP iment. AUC is the area enclosed by the ROC curve and the coordi-
nate axis. Its value range is [0.5, 1]. The closer to 1, the better the
model effect. Since AUC is not affected by unbalanced data, it can
most of the literature and are often used to examine the detection better reflect the overall detection level of the model.
performance of the model, which is authoritative. In addition to examining the detection ability of the model, the
The KDD 99 dataset is created by simulating the US Air Force training and testing time of the model also needs to be considered.
LAN environment and later used publicly in the CUP competition. When the detection ability of the model is similar, the shorter the
It includes 5 million pieces of network connection data. The exper- time, the better. Short time is of great significance for real-time
iment uses 10% of the training samples provided by it (more than monitoring of intrusion detection.
490,0 0 0 instances) and corrected test samples with labels (more
than 310,0 0 0 instances). Labels are normal and abnormal, and ab- 3.3. Data preprocessing
normal can be divided into four categories, with 39 attack types.
Since 10% of the training samples only contain 22 attack types. In The KDD99 and NSL-KDD datasets contain character data, and
contrast, the test samples have 39 attack types, neither the hold- there is a sizeable numerical gap due to different measurement
out method nor the cross-validation method on the training sam- units. Therefore, it is necessary to digitize the data characters and
ples can better reflect the performance of the algorithm. normalize the data.
The NSL-KDD dataset removes redundant information and du- Due to the high data dimension, more time and space are re-
plicates records in KDD99, which can better reflect the detection quired for experiments, and the efficiency is low, so the Princi-
ability of the algorithm, but it cannot reflect the real network. The pal Component Analysis (PCA) is used for dimensionality reduc-
training set contains more than 120,0 0 0 network connection data, tion. To find the appropriate dimension, the random forest was
and the test set contains more than 20,0 0 0 pieces. used to conduct experiments on KDD99 and NSL-KDD data (41 di-
mensions) and various dimension-reduced data. The experimental
3.2. Evaluation indicators results are shown in Tables 6 and 7. By comparison, it is found that
the detection effect is the best when the KDD99 feature dimension
Before introducing several evaluation indicators, first under- is 12, and the detection effect is the best when the NSL-KDD fea-
stand TP, TN, FP, and FN, as shown in Table 5. ture dimension is 10. To make the experiment more rigorous, algo-
TN: Indicates correctly predicted as normal. rithms such as Naive Bayes and decision trees are used for testing,
FP: Indicates mispredicted as abnormal. and the data dimensions are determined after comparison.
FN: Indicates mispredicted as normal.
TP: Indicates correctly predicted as abnormal. 3.4. Experimental process and analysis
The experiments select accuracy, precision, recall, F1, AUC, and
algorithm running time as evaluation indicators. The operating system used in the environment platform built
In intrusion detection, detection accuracy is the percentage during the experiment is Windows 10, the CPU frequency is
of correctly identified normal and abnormal data to all detected 2.60GHz, the memory is 8GB, the 1T hard disk storage is empty,
data, which represents the overall detection level of the algorithm and the programming tool is PyCharm2019.3.3. Use decision tree,
model. Because of its convenient calculation, it belongs to the most SVM, Naive Bayes, random forest, and XGBoost in the sklearn li-
commonly used algorithm model comparison index. It is calculated brary for experiments. Select the sequential model in the Keras li-
as follows: brary to construct CNN, RNN, and LSTM for experiments. The ex-
TP + TN perimental model is adjusted based on the default parameters. The
accuracy = specific parameters and structures are shown in Table 8. Use the
TP + TN + FP + FN
training and test set of the above KDD99 and NSL-KDD intrusion
Precision describes the percentage of correctly predicted data
detection data sets to conduct simulation experiments. The exper-
out of all data predicted to be abnormal behavior. The higher the
imental results are shown in Figs. 9–12.
precision rate, the lower the misjudgment rate of the algorithm
From an overall perspective, Fig. 9 shows the accuracy, F1, and
model for normal behavior data. Calculated as follows:
AUC of each model on the KDD99 dataset (because the SVM train-
TP
precision = ing time for 490,0 0 0 data is too long, and even the optimal hyper-
TP + FP plane cannot be found, the SVM model uses 10 0,0 0 0 data sets after
The recall rate is the percentage of all abnormal behavior data stratified sampling of 10% of the KDD99 data set). It can be seen
that are accurately predicted to be abnormal, which represents the that the classification effect of the Naive Bayes classifier is poor,
ability of the algorithm model to identify abnormal behavior data. and the indicators are the lowest. On the contrary, XGBoost has the
The higher the recall rate, the lower the model’s missed detection best overall performance, but the advantage compared with other
rate for abnormal data. machine learning algorithms is not obvious.
TP Fig. 10 shows the experimental results on the NSL-KDD dataset.
recall = Comparing Figs. 9 and 10, it can be found that the values of each
TP + FN
index in Fig. 10 are significantly reduced. Because the NSL-KDD
F1 is an evaluation index to measure the two-class model. Its
dataset is an improvement of the KDD99 dataset, it removes many
value is obtained by weighting the precision rate and the recall
redundant duplicate data so that the detection algorithm will not
rate, which represents the quality of the model to a certain extent.
be biased towards data with a high repetition rate. It can be seen
2 ∗ precision ∗ recall from Fig. 10 that the accuracy of the integrated learning algorithm
F1 = is lower than that of F1, but the AUC is the best. The difference
precision + recall
from before is that although the classification effect of Naive Bayes
12
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
Table 6
Experimental comparison of KDD99 feature dimension.
Dimension 41 25 20 15 12 11 10 9
Table 8
Experimental model parameters.
Model Parameters/Structures
13
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
14
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
number of iterations. CNN takes 625 s for 20 iterations, RNN takes types of attacks in the data set, and the distribution is highly un-
321 s for 15 iterations, and LSTM takes 1026 s for 10 iterations. balanced. For example, in the millions of data, the amount of data
of some attack types is only one digit, making it difficult for a
3.5. Experiment summary small amount of data to be fully trained. Nowadays, the main-
stream method is to increase the proportion of a small amount of
Combining all the data, it can be seen that ensemble learning data. Although the detection results are optimized, the improve-
has a good effect on intrusion detection research. Although the re- ment is limited, and further improvement plans are still needed.
call rate is the lowest in the NSL-KDD data set, the overall indi- (4) How to complete incremental learning when faced with
cator AUC is the best, and NSL-KDD does not conform to the ac- new categories of data.
tual intrusion detection data. The deep learning algorithm is not At present, the machine learning algorithms that can achieve
particularly prominent in this experiment, but its optimal results incremental learning are mainly SVM and ensemble learning, but
are affected by the structure, hyperparameters, and the number the accuracy of these algorithms is not high. Some scholars have
of training iterations, which need further in-depth study. Although tried to use deep learning to achieve, but the results are unsat-
the decision tree in the traditional machine learning algorithm has isfactory. How to identify a small number of new types of attack
a good effect, it is inferior to ensemble learning, so it is gradually data and completing incremental learning has become a difficulty
replaced by ensemble learning. SVM is difficult to process large in current research.
order of magnitude data, so it needs to study incremental learn-
ing. Although the Naive Bayes algorithm has low recognition accu- 5. Conclusions
racy for learned data, it has obvious advantages over other models
when faced with new types of attacks. Its training speed is faster, This paper summarizes the application and research of machine
so it can be used to detect new types of data. learning in network intrusion detection systems. By comparing and
analyzing some common machine learning algorithms in the intru-
4. Future research directions sion detection field in recent years, we can understand the relevant
information of different machine learning algorithms. In the area of
Through the above analysis and comparison, it can be seen that intrusion detection, there have been few studies on traditional ma-
machine learning has achieved certain results in network intrusion chine learning algorithms in recent years. There is more research
detection, and many algorithm models have been proposed or put on ensemble learning and deep learning models. The development
into use. However, the current model still has many limitations and of ensemble learning tends to mature and is applied to various
faces many problems and challenges. Further research is needed, fields, while deep learning is still in the exploration of algorithm
mainly including: models. To further understand the relevant algorithms, this paper
(1) How to preprocess when faced with different datasets. uses KDD99 and NSL-KDD as experimental data to conduct exper-
There are many network intrusion detection data sets, and the imental research on the algorithm. From the experimental results,
preprocessing methods for different data sets are also different. For the overall effect of ensemble learning is better, but other algo-
example, in the dimensionality reduction process of the experi- rithms also have their corresponding advantages. In the later stage,
ment in this paper, there is a situation where the dimensions are we can try to explore new models to achieve the advantages of
not very different, and the detection results are pretty different. integrating multiple algorithms as much as possible. The experi-
NSL-KDD is a dataset optimized by KDD99. Even if the dimensions ments in this paper only use KDD series data, and other data and
of the two datasets are the same and the meaning of each dimen- preprocessing methods need to be studied. The experiment real-
sion data is the same, the optimal dimension reduction is different. izes the two-classification of normal and abnormal data and does
What’s more, most network intrusion detection datasets have dif- not perform multi-classification detection. At the same time, after
ferent dimensions and even different formats. Moreover, the pro- careful division, there are many types of intrusion behaviors, and
cess of data preprocessing not only includes dimensionality reduc- there are many new types of data in real life. The distribution of
tion, but also character numeralization, data normalization, and so these types of intrusions is highly uneven. Therefore, in the later
on, which makes it difficult to unify the way of data preprocessing stage, it is still necessary to explore the detection capabilities of
when facing different data sets. Therefore, constructing an algo- various machine learning algorithms for unbalanced and new types
rithm model and system that can adaptively find the optimal data of data and to find or build good algorithms.
preprocessing method also needs further research.
(2) How to build a model that takes into account both small Declaration of Competing Interest
data sets and large data sets.
Algorithmic models such as neural networks and ensemble We declare that we have no financial and personal relationships
learning have demonstrated their excellent detection capabilities with other people or organizations that can inappropriately influ-
in the face of large data sets. For small data sets, there are ence our work, there is no professional or other personal interest
also related algorithm studies. For example, literature (Feng et al., of any nature or kind in any product, service and/or company that
2014) proposed a detection model for small data sets. The test could be construed as influencing the position presented in, or the
results were excellent. On the contrary, the detection accuracy review of, the manuscript entitled.
of the model dealing with large datasets decreases when faced
References
with small datasets, and the training time of models dealing with
small datasets is significantly prolonged when dealing with large
Liu, Y.J., 2017. National security strategy and its improvement. Expand. Horiz. 4,
datasets. Moreover, due to the large scale of network intrusion de- 5–10.
tection data sets, there are few pieces of research on network in- National Computer Network Emergency Response Technical Team/Coordination Cen-
ter of China(CNCERT/CC) [Internet]. China internet network security monitoring
trusion detection based on small data sets, and models that con-
data analysis report in the first half of 2020. https://www.cert.org.cn/publish/
sider both small and large data sets are even rarer. Therefore, fur- main/upload/File/2020Report(2).pdf, 2020 (accessed 15 March 2021).
ther research is needed in this area. Wu, Z.H., 2019. Information Security Technology and Practice, 5th ed. Liaoning Sci-
(3) How to perform multi-classification for imbalanced data. ence and Technology Publishing House, Shenyang.
Xin, Y., Kong, L.S., Liu, Z., Chen, Y.L., Li, Y.M., Zhu, H.L., et al., 2018. Machine learn-
The problem of unbalanced data has always been the focus and ing and deep learning methods for cybersecurity. IEEE Access 6, 35365–35381.
difficulty of network intrusion detection research. There are many doi:10.1109/ACCESS.2018.2836950.
15
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
Gumusbas, D., Yildirim, T., 2022. AI for Cybersecurity: ML-Based Techniques for In- Yao, W., Wang, J., Zhang, S.L., 2015. Intrusion detection model based on decision
trusion Detection Systems. Advances in Machine Learning/Deep Learning-based tree and Naïve-Bayes classification. J. Comput. Appl. 35 (10), 2883–2885.
Technologies, pp. 117–140. doi:10.1007/978- 3- 030- 76794- 5_7. Wang, H., Chen, H.Y., Liu, S.F., 2014. Intrusion detection system based on improved
Handa, A., Sharma, A., Shukla, S.K., 2019. Machine learning in cybersecurity: a re- Naïve Bayesian algorithm. Comput. Sci. 41 (04), 111 -115+119.
view. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 9 (4), e1306. doi:10.1002/ Zhang, B., Liu, Z.Y., Jia, Y.G., Ren, J.D., Zhao, X.L., 2018. Network intrusion detection
widm.1306. method based on PCA and Bayes algorithm. Secur. Commun. Netw. 2018 (10),
Kolandaisamy, R., Noor, R.M., Kolandaisamy, I., et al., 2020. A stream position perfor- 1–11.
mance analysis model based on DDoS attack detection for cluster-based routing Gu, J., Lu, S., 2021. An effective intrusion detection approach using SVM with naïve
in VANET. J. Ambient Intell. Humaniz. Comput. 6, 1–14. Bayes feature embedding. Comput. Secur. 103, 102158. doi:10.1016/j.cose.2020.
Williamson, M.M., 2003. Resilient infrastructure for network security. Wiley Subscr. 102158.
Serv. Inc. A Wiley Co. 9 (2), 34–40. Breiman, L., 2001. Random forest. Mach. Learn. 45, 5–32.
J.P. Anderson Computer security threat monitoring and surveillance. 1980. Hammad, M., Hewahi, N., Elmedany, W., 2021. TNERF: a novel high accuracy ma-
T.F. Lunt, R. Jagannathan, R. Lee, S. Listgarten, D.L. Edwards, P.G. Neumann, et al. chine learning approach for Intrusion detection systems. IET Inf. Secur. 15 (2),
IDES: the enhanced prototype AReal-time intrusion-detection expert system. 178–190.
1988. doi:https://doi.org/10.13140/RG.2.1.3905.1685. Iwendi, C., Khan, S., Anajemba, J.H., Mittal, M., Alenezi, M., Alazab, M., 2020. The
Denning, DE., 1987. An intrusion-detection model. IEEE Trans. Softw. Eng. 13 (2), use of ensemble models for multiple class and binary class classification for
222–232. improving intrusion detection systems. Sensors 20 (9), 2559.
C. Kahn, P.A. Porras, S.S. Chen, B. Tung A common intrusion detection framework. Boahen, E.K., Elvire, B., Wang, C., 2021. Network anomaly detection in a con-
Position Paper of Information Survivability Workshop. 1998. trolled environment based on an enhanced PSOGSARFC. Comput. Secur. 104 (4),
Otoum, Y., Nayak, A., 2021. AS-IDS: anomaly and signature based IDS for the inter- 102225.
net of things. J. Netw. Syst. Manag. 29 (23). doi:10.1007/s10922- 021- 09589- 6. Nazir, A., Khan, RA., 2020. A novel combinatorial optimization based feature selec-
Baig, Z., Salah, K., 2016. Distributed hierarchical pattern-matching for network in- tion method for network intrusion detection. Comput. Secur. 102, 102164.
trusion detection. J. Internet Technol. 17 (2), 167–178. doi:10.6138/JIT.2016.17.2. Mokhtari, S., Abbaspour, A., Yen, K.K., Sargolzaei, A.A, 2021. Machine learning ap-
20131021. proach for anomaly detection in industrial control systems based on measure-
Benferhat, S., Boudjelida, A., Tabia, K., Drias, H., 2013. An intrusion detection and ment data. Electronics 10 (4), 407.
alert correlation approach based on revising probabilistic classifiers using expert Ghaleb, F.A., Saeed, F., Al-Sarem, M., Al-rimy, B.A.S., Boulila, W., Eljialy, A.E.M.,
knowledge. Appl. Intell. 38 (4), 520–540. doi:10.1007/s10489- 012- 0383- 7. et al., 2020. Misbehavior-aware on-demand collaborative intrusion detection
Ilgun, K., Kemmerer, RA., 1995. State transition analysis: a rule-based intrusion de- system using distributed ensemble learning for VANET. Electronics 9 (9),
tection approach. IEEE Trans. Softw. Eng. 21 (3), 181–199. 1411.
Yin, L.B., 2019. National industrial information security development research cen- Upadhyay, D., Manero, J., Zaman, M., Sampalli, S., 2021. Gradient boosting feature se-
ter. Decoding: Industrial Cyber Security, 1th ed. Publishing House of electronics lection with machine learning classifiers for intrusion detection on power grids.
industry, Beijing. IEEE Trans. Netw. Serv. Manag. 18 (1), 1104–1116.
Jiang, J.C., Ma, H.T., Ren, D.E., Qing, S.H., 20 0 0. A survey of intrusion detection re- Mishra, P., Varadharajan, V., Pilli, E., Tupakula, U., 2020. VMGuard: a VMI-based
search on network security. J. Softw. 11, 1460–1466. security architecture for intrusion detection in cloud environment. IEEE Trans.
Luca, B., Marco, C., Mario, M., Enrico, M., Talha, N., Sandro, Z., 2017. Statistical finger- Cloud Comput. 8 (3), 957–971.
print-based intrusion detection system (SF-IDS). Int. J. Commun. Syst. 30 (10), Morfino, V., Rampone, S., 2020. Towards near-real-time intrusion detection for IoT
1–11. devices using supervised learning and apache spark. Electronics 9 (3), 444.
Nassif, A.B., Talib, M.A., Nasir, Q., Dakalbab, F.M., 2021. Machine learning for anomaly Amouri, A., Alaparthy, V.T., Morgera, S.D., 2020. A machine learning based intrusion
detection: a systematic review. IEEE Access 9, 78658–78700. detection system for mobile internet of things. Sensors 20 (2), 461.
Sun, R., Zhang, S., Yin, C., Wang, J., Minet, S., 2019. Strategies for data stream mining Karthik, M.G., Krishnan, M., 2021. Hybrid random forest and synthetic minority over
method applied in anomaly detection. Clust. Comput. 22, 399–408. sampling technique for detecting internet of things attacks. J. Ambient Intell.
Cañete-Sifuentes, L., Monroy, R., Medina-Pérez, M.A., 2021. A review and experimen- Humaniz. Comput. doi:10.1007/s12652- 021- 03082- 3.
tal comparison of multivariate decision trees. IEEE Access 9, 110451–110479. Karthikraja, C., Senthilkumar, J., Hariharan, R., Devi, G.U., Suresh, Y., Mohanraj, V.,
Hota, H.S., Shrivas, A.K., 2014. Decision tree techniques applied on NSL-KDD data 2022. An empirical intrusion detection system based on XGBoost and bidirec-
and its comparison with various feature selection techniques. Adv. Comput. tional long-short term model for 5G and other telecommunication technologies.
Netw. Inform. 1, 205–211. Comput. Intell. doi:10.1111/coin.12497.
Bagyalakshmi, C., Samundeeswari, E.S., 2020. DDoS attack classification on cloud en- Zhang, W.G., Zhang, R.H., Wu, C.Z., Goh, A.T.C., Lacasse, S., Liu, Z.Q., et al., 2020.
vironment using machine learning techniques with different feature selection State-of-the-art review of soft computing applications in underground excava-
methods. Int. J. Adv. Trends Comput. Sci. Eng. 9 (5), 7301–7308. tions. Geosci. Front. 11 (04), 1095–1106.
Umak, M.R., Mishra, R., 2014. An efficient modular approach of intrusion detection Bhattacharya, S., Krishnan, S.S.R., Maddikunta, P.K.R., Kaluri, R., Singh, S.,
system based on MSPSO-DT. Int. J. Adv. Res. Comput. Sci. 5 (3), 47–53. Gadekallu, T.R., et al., 2020. A novel PCA-firefly based XGBoost classification
Mahbooba, B., Timilsina, M., Sahal, R., Serrano, M., 2021. Explainable artificial intelli- model for intrusion detection in networks using GPU. Electronics 9 (2), 219.
gence (XAI) to enhance trust management in intrusion detection systems using Wang, X.L., Lu, X., 2020. A host-based anomaly detection framework using XGBoost
decision Tree model. Complex. 2021, 1–11. and LSTM for IoT devices. Wirel. Commun. Mob. Comput. 2020, 1–13.
Hearst, M.A., Dumais, S.T., Osman, E., Platt, J., Scholkopf, B., 1998. Support vector Bedi, P., Gupta, N., Jindal, V., 2021. I-SiamIDS: an improved Siam-IDS for handling
machines. IEEE Intell. Syst. Their Appl. 13 (4), 18–28. class imbalance in network-based intrusion detection systems. Appl. Intell. 51
Pan, Y.Q., Zhai, W.P., Gao, W., Shen, X.J., 2020. If-SVM: iterative factoring support (2), 1133–1151.
vector machine. Multimed. Tools Appl. 79 (35-36), 25441–25461. Qiao, N., Li, Z.X., Zhao, GS., 2022. Intrusion detection model of internet of
Chauhan, V.K., Dahiya, K., Sharma, A., 2019. Problem formulations and solvers in things based on XGBoost-RF. J. Chin. Mini Micro Comput. Syst. 43 (01), 152–
linear SVM: a review. Artif. Intell. Rev. 52 (2), 803–855. 158.
Meddeb, R., Jemili, F., Triki, B., Korbaa, O., 2019. Anomaly-based behavioral detection Kumar, P., Gupta, G.P., Tripathi, R., 2021. An ensemble learning and fog-cloud ar-
in mobile Ad-Hoc networks. Procedia Comput. Sci. 159, 77–86. chitecture-driven cyber-attack detection framework for IoMT networks. Comput.
Al-Qatf, M., Yu, L., Al-Habib, M., Al-Sabahi, K., 2018. Deep learning approach combin- Commun. 166, 110–124.
ing sparse autoencoder with SVM for network intrusion detection. IEEE Access Xu, M.F., Li, X.H., Wang, Y.W., Luo, B., Guo, J.J., 2020. Privacy-preserving multi-
6, 52843–52856. source transfer learning in intrusion detection system. Trans. Emerg. Telecom-
Shen, K., Parvin, H., Qasem, S.N., Tuan, B.A., Pho, K.H., 2020. A classification model mun. Technol. 32 (5), e3957.
based on SVM and fuzzy rough set for network intrusion detection. J. Intell. Abu Al-Haija, Q., Zein-Sabatto, S., 2020. An efficient deep-learning-based detec-
Fuzzy Syst. 39 (1), 1–17. tion and classification system for cyber-attacks in IoT communication networks.
Wang, S., Jin, Z.G., 2020. IDS classification algorithm based on fuzzy SVM model. Electronics 9 (12), 2152.
Appl. Res. Comput. 37 (02), 187–190. Zhou, X.K., Liang, W., Shimizu, S., Ma, J.H., Jin, Q., 2021. Siamese neural network
Feng, W.Y., Zhang, Q.L., Hu, G.Z., Huang, J.X.J., 2014. Mining network data for intru- based few-shot learning for anomaly detection in industrial cyber-physical sys-
sion detection through combining SVMs with ant colony networks. Futur. Gener. tems. IEEE Trans. Ind. Inf. 17 (8), 5790–5798.
Comput. Syst. 37, 127–140 Int. J. Escience. van Wyk, F., Wang, Y.Y., Khojandi, A., Masoud, N., 2020. Real-time sensor anomaly
Kabir, E., Hu, J.K., Wang, H., Zhuo, GP., 2018. A novel statistical technique for in- detection and identification in automated vehicles. IEEE Trans. Intell. Transp.
trusion detection systems. Futur. Gener. Comput. Syst. 79, 303–318 The Interna- Syst. 21 (3), 1264–1276.
tional Journal of Escience. Nie, L.S., Ning, Z.L., Wang, X.J., Hu, X.P., Cheng, J., Li, YK., 2020. Data-driven intru-
Pozi, M.S.M., Sulaiman, M.N., Mustapha, N., Perumal, T., 2015. Improving anomalous sion detection for intelligent internet of vehicles: a deep convolutional neural
rare attack detection rate for intrusion detection system using support vector network-based method. IEEE Trans. Netw. Sci. Eng. 7 (4), 2219–2230.
machine and genetic programming. Neural Process. Lett. 44 (2), 1–12. Jeong, S., Jeon, B., Chung, B., Kim, H.K., 2021. Convolutional neural network-based
Zhu, J., Hu, W.B., 2015. Recent advances in Bayesian machine learning. J. Comput. intrusion detection system for AVTP streams in automotive Ethernet-based net-
Res. Dev. 52 (01), 16–26. works. Veh. Commun. 29, 100338.
Koch, I., Naito, K., Tanaka, H., 2019. Kernel naive Bayes discrimination for high-di- Khan, MA., 2021. HCRNNIDS: hybrid convolutional recurrent neural network-based
mensional pattern recognition. Aust. N. Z. J. Stat. 61 (4), 401–428. network intrusion detection system. Processes 9 (5), 834.
Zhang, H.P., Cheng, N., Zhang, Y., Li, Z.B., 2021. Label flipping attacks against Naive Riyaz, B., Ganapathy, S., 2020. A deep learning approach for effective intrusion de-
Bayes on spam filtering systems. Appl. Intell. 51 (7), 4503–4514. tection in wireless networks using CNN. Soft Comput. 24 (22), 17265–17278.
16
C. Zhang, D. Jia, L. Wang et al. Computers & Security 121 (2022) 102861
Dr. Chunying Zhang received the BS in School of Com- Dr. Aimin Yang is currently a professor at the School of
puter Science and Technology from Heilongjiang Univer- College of Science in North China University of Science
sity, and the MS and Ph.D. in Computer Application Tech- and Technology. His research interests include Intelligent
nology from Yanshan University of school of information algorithm design, steel big data and metallurgical intelli-
science and engineering. She is currently a professor at gent manufacturing, medical big data and AI-assisted di-
the School of College of Science in North China Univer- agnosis and treatment system development, ship multi-
sity of Science and Technology. She is a member of China attribute intelligent decision-making.
Computer Federation (CCF). Her research interests include
data mining, rough set and social network analysis.
17