LBDMIDS LSTM Based Deep Learning Model F
LBDMIDS LSTM Based Deep Learning Model F
Abstract—In the recent years, we have witnessed a huge growth as they focus more on attack patterns but give high false
in the number of Internet of Things (IoT) and edge devices being positives.
used in our everyday activities. This demands the security of
these devices from cyber attacks to be improved to protect its Therefore, NIDS integrated with ML and DL techniques
users. For years, Machine Learning (ML) techniques have been have been developed to better identify new threats [3]. Earlier,
used to develop Network Intrusion Detection Systems (NIDS) ML and DL methods have been applied unanimously to
with the aim of increasing their reliability/robustness. Among the develop systems capable of detecting intrusion in the networks
earlier ML techniques DT performed well. In the recent years, for conventional and extensive communication. However, ev-
Deep Learning (DL) techniques have been used in an attempt
to build more reliable systems. In this paper, a Deep Learning ery method came with limitations of its own and gradually
enabled Long Short Term Memory (LSTM) Autoencoder and a the need for a better system increased as the classical models
13-feature Deep Neural Network (DNN) models were developed failed in terms of handling the heterogeneity of data. Also,
which performed a lot better in terms of accuracy on UNSW- the dataset used here should have multiple types of attack
NB15 and Bot-IoT datsets. Hence we proposed LBDMIDS, where vectors (like DoS, DDoS, Worms) to tackle multi-classification
we developed NIDS models based on variants of LSTMs namely,
stacked LSTM and bidirectional LSTM and validated their of attack categories. Traditional models are unable to detect
performance on the UNSW NB15 and BoTIoT datasets. This zero-day attacks as the dataset is unable to hold the entries
paper concludes that these variants in LBDMIDS outperform which are new to the attack analysis.
classic ML techniques and perform similarly to the DNN models So, a smart model is needed which can detect any anomaly
that have been suggested in the past.
which rises from any deviation from normal behavior of the
Index Terms—IoT Security, Intrusion, IDS, LSTM, Deep associated network. So, this hypotheses could also detect zero-
Learning day attack as the model will not solely depend on the pre-
built classes of attacks. However, this will again give rise to a
I. I NTRODUCTION problem which will bias towards the general class. This will
Internet of Things (IoT) is a collection of devices which lead to raised false positive rates. To get an intermediate model
gather and share data over the Internet. Data is gathered using which can act more sustainable, DL needs to be implemented.
sensors, which are embedded in these devices. IoT devices Classic ML techniques have been used in this field for 20+
are used in many areas, ranging from household devices like years while keeping the KDD99 dataset in consideration. But,
smart watches, smart bulbs, smart air conditioners, temperature with the onset of technological boom, the attack categories
sensors to more complex devices like smart vehicles and smart are increasing spontaneously. So, the potential of any classic
electrical grids. They are also extensively used in manufac- ML model is far smaller than the reach of the intrusion. So,
turing, transportation, infrastructure, military equipment and methods like SVM (Support Vector Machine), DT (Decision
healthcare. It was estimated that there are over 35 billions Tree), KNN (K-Nearest Neighbor), etc. have been ruled out
IoT devices would be in use upto 2021 [1]. This high number long before when considering for industrial applications. Also,
has contributed to increasing number of cyber attacks on IoT the hybrid method using these models has worked well for a
networks, which demands the security of these devices from long time, but eventually falls short in front of recent DL
cyber attacks to be increased as the current security measures models. The basis for working of an ML model is basically
have proven to be inadequate [2]. supervised and unsupervised learning techniques other than
Network Intrusion Detection Systems (NIDS) are used to the reinforcement learning part. ML extensively depends upon
detect cyber attacks and malicious activities like Denial of how rich the dataset is available to us. Also, the inability to
Service (DoS), Distributed DoS (DDoS), Worms, Backdoor, scale the data accordingly also poses a large limitation to the
etc. Network traffic is monitored and potential threats are extensive use of any ML model.
identified. Signature-based NIDS is good at detecting known To handle such issues, DL methods like ANN (Artificial
attacks but fail at detecting attacks which have not been seen Neural Networks) came into picture. Here the model is built
before. Anomaly-based NIDS is good at detecting new attacks on neurons which are coordinated by parameters and hyper-
parameters. To scale the input and use it on extensive scale, accuracy of 63.97% and False Alarm Rates of 36.03%. In [7],
the number of layers are kept accordingly to get maximum the authors have addressed the issue of lack of a data set which
efficiency. DL methods have proved far better than the ML can appropriately show the modern network traffic and attacks.
models in terms of accuracy, precision, etc. with the ability So this paper looks into the creation of a UNSW-NB15 dataset.
to handle large amounts of data. ANN, CNN (Convolutional It has 49 features developed using Argus and Bro-IDS tools.
Neural Networks), RNN (Recurrent Neural Networks), FDN In paper [8], the authors proposed a hybrid IDS for IoT net-
(Feed Forward Deep Networks) are many examples of DL works by combining CNNs (Convolutional Neural Network)
architectures. DL techniques in general have outperformed ML and LSTM. They used UNSW NB-15 dataset and compared
techniques. the performances of the hybrid IDS and RNN for binary
As research progressed, it was seen that basic DL archi- classification, and the accuracy achieves was 95.7% and 98.7%
tectures lacked the ability to detect unknown attacks and respectively. In [9], the authors compared the performances of
even if they did, the false positives and false negatives were DL methods like RNN, GRU and Text-CNN with some of the
very high. To tackle this problem, LSTM (Long Short-Term traditional ML methods on the KD99 and ADFA-LD datasets.
Memory) was introduced. This methodology is a special form The DL methods were able to outperform the traditional ML
of RNN. The most distinguishing feature of LSTM is to keep techniques. In paper [10], the authors implemented Linear
the information/parameter for later use in the system. Thus, Discriminant Analysis (LDA), Classification and Regression
they can handle the data which is time series and could be Tree (CART) and Random Forest (RF) on KD99 dataset. The
variable with respect to time. accuracy achieved were 98.1%, 98% and 99.65% respectively.
The primary motivation of using LSTM lies in its approach In [11], the authors implemented a Feed-Forward Neural
where is is not restricted to the limitations of conventional Network for binary and multi-class classification on the Bot-
DL (neural networks) methods. Here, the input sequences and IoT dataset for normal and three different attack categories
output sequences between layers are variable which could ac- (DDoS/DoS, Reconnaissance, Information Theft). They were
cordingly work efficiently to detect known as well as unknown able to achieve 98%, 99.4%, 98.4% and 88.9% accuracy on
attacks. the normal and the three attack categories respectively. In
The paper is organized as follows: Section II contains paper [12], the authors proposed an IDS based on blockchain
related research which demonstrates the earlier work done, and deep learning. The DNN model was able to achieve 98%
Section III describes the datasets used in our experiments, accuracy for binary classification and 97% accuracy for multi-
Section IV includes our proposed methodology, Section V class classification on the NSL-KDD dataset.
contains the results and performance analysis, and Section VI
gives the conclusion and future works. III. DATASET E XPLANATION
In order to train our models and to determine their reliability
II. L ITERATURE R EVIEW
in testing phase, Data selection is necessary. In the past
In paper [4], IDS was developed to learn the behavior datasets like KDD98, KDDCUP99 and NSLKDD were used
of normal network traffic. Dataset used was UNSW-NB 15 as comprehensive datasets for Network Intrusion Detection
for communication of external networks. The methods being System(NIDS). In the recent times the researches have shown
experimented here were Artificial Immune System (AIS), that the these datasets don’t reflect the modern network traffic
Filtered-based SVM (FSVM), Euclidean distance Map (EDM), (normal and attack vectors). Hence, we selected a few datasets
and Geometric Area Analysis (GAA) which performed 85%, that were made publicly available by researchers in the last few
92%, 90% and 93% respectively. In [5], BoT-IoT and UNSW- years. These datasets contained labelled network data that were
NB15 datasets are considered in the experiment. After the data generated in labs with the help of virtual network setup. The
standardization is done, the MLP (Multi-Layered Perceptron) datasets are a hybrid of Normal network traffic and synthetic
model is used followed by adapting the hyperparamters. The botnet attack traffic. The datasets selected are:
comparison was based on the BoT-IoT dataset while the
classifier was built on UNSW-NB15 dataset. ARM (Associ- A. UNSW-NB15 Dataset
ation Rule Mining) and Naı̈ve Bayes only had 85% and 72%
Widely in use since 2015 as a dataset for evaluating NIDS, it
accuracy respectively. The normal perceptron model was basic
was created by Cyber Range Lab of UNSW-Canberra [14-17].
and gave 63% accuracy but when converted to the prima facie
The Dataset contains 49 (excluding class labels) features and a
13- feature DNN model, it gave 99% accuracy. Among the ML
total 2,540,044 records that were split across 4 CSV files. The
techniques, Decision tree gave the highest accuracy (93%).
dataset contains a total of 10 class labels out of which there
The process of training the models consumes a lot of time
are 9 types of Botnet attack vectors and 1 Normal class. The
and memory [5]. In paper [6], the authors explore ML based
different types of attack traffic are as follows: analysis, fuzzers,
approach. On carrying out an experiment it was seen that
backdoor, generic, Denial of Service (DoS), reconnaissance,
the ML methods along with flow identifiers were effective
shellcode, exploits and worms [13]. Out of the 49 features 13
in detecting botnet attacks. Four ML algorithms were used
important ones were selected in the paper [4], namely source
namely, ARM, Artificial Neural Network (ANN), Naı̈ve Bayes
ip address, source port, destination ip address, destination
and Decision Tree. Decision Tree gave the best results with
port, duration, source bytes, destination bytes, source TTL,
the highest accuracy of 93.23% and the lowest False Alarm
destination TTL, Source load, Destination load, Source packets
Rate of 6.77% whereas ANN was the least accurate with an
and Destination packets.
3
interacting layers which work in a special way to remember For Bot-IoT dataset, the 10 features that were pre-selected
the important information for a long duration of time. in the Reduced training and testing set were:
The problem of vanishing gradient faced by RNN is also 1) rate - Total packets per second in transaction
solved by LSTM as they continue to learn new information 2) srate - Source-to-destination packets per second
while keeping the previous ones. Hence the significance of 3) drate - Destination-to-source packets per second
parameters is maintained and the model becomes stable. 4) min - Minimum duration of aggregated records
LSTMs also have different variations. Although the dif- 5) max - Maximum duration of aggregated records
ference is pretty small between those, they could be of 6) mean - Average duration of aggregated records
great significance depending upon the input data fed to the 7) std dev - Standard deviation of aggregated records
network. Common types of LSTM are Classic, Stacked and 8) state number - Numerical representation of feature state
Bidirectional. Stacked LSTM has several LSTM layers and 9) flgs number - Numerical representation of feature flags
can only access past samples. Bi-directional LSTM has two 10) seq - Argus sequence number
layers and has access to both past and future samples. 2) z-scale Normalization: The process of normalization is
done to transform the data in a way they are distributed
B. Network Intrusion Detection System(NIDS) Architecture similarly. It helps the model to treat each feature with similar
We proposed NIDS based on two variants of LSTM - importance as it provides similar weights to each feature.
Stacked LSTM and Bi-directional LSTM to help filter out Assuming the feature subspace has N rows and M columns
normal and attack vectors in IoT network traffic. The workflow i.e., X (feature subspace) = RN ×M , the z-scale normalization
of the NIDS architecture can be divided into three phases, can be implemented as follows:
namely the data Preprocessing, Training of the Model and PN −1
Model Validation as shown in Fig.1. xim
µm = i=0 (1)
N
PN −1
C. Data Preprocessing (xim − µm )2
σm = i=0 (2)
In Data Preprocessing stage, the raw data extracted from N
IoT networks in form of PCAP/CSV files is processed in a z-scale normalized feature vector can be obtained as
suitable form to be fed to the LSTM Model. The data is
filtered to remove any redundancies and get rid of Null values. Xm − µ m
zm = (3)
Afterwards the most important features are selected to be fed σm
into the proposed model, which is then followed by z-score Here, µm is the mean of the entries of the m’th column and
Normalization of feature which ensures similar distributions σm is the standard deviation of the entries of the m’th column.
for each feature.
1) Feature Extraction: We managed to reduce the Dimen- Algorithm 1 Z-SCALE NORMALIZATION
sionality of the raw data by selecting the most important
features which made the data suitable for processing. Very for each col m in X(0,1,2...M-1)
often, large Datasets have a lot of redundant and correlated µm ← compute(1)
data that can be filtered out without losing important or σm ← compute(2)
relevant information. In case of USNW NB15, we selected zm ← compute(3)
the following most important 13 features [1]: end for.
1) Source ip address - IP address of the attacker computer
2) Source port - Port number of the attacker computer
3) Destination ip address - IP address of the victim com- D. Training and Validation Process
puter Data Processing stage transforms the data into a more
4) Destination port - Port number of the victim computer suitable form to be processed by the model which will lead to
5) Duration - Record total duration of transaction more accurate predictions. It is followed by changing the shape
6) Source bytes - Number of bytes sent from source to the of the data to be processed by the LSTM layers. A suitable
destination Timesteps parameter is selected and the dimensionality of the
7) Destination bytes - Number of bytes sent from destina- dataset is changed(samples, timesteps, features). Timesteps
tion to the source is the number of past samples on which the LSTM model
8) Source TTL - Source to destination time to live value looks back at. The training and validation period of the model
9) Destination TTL - Destination to source time to live consists of feeding the time-series sequential data to the LSTM
value layers.
10) Source load - Transmission rate in bits per second 1) Stacked LSTM: In case of stacked LSTM, there are
11) Destination load - Reception rate in bits per second multiple layers of LSTM stacked on top of one another. The
12) Source packets - Number of packets sent from source to LSTM layers help in uncovering the patterns and dependence
the destination of features to their class labels because they can learn at higher
13) Destination packets - Number of packets sent from levels of abstractions. The input LSTM layer is followed by a
destination to the source batch of hidden LSTM layers that process sequenced input and
5
combine learning patterns from the previous layers, to produce • True Positive (TP): Number of correctly predicted attack
learning representations at higher levels of abstraction. The samples.
Dense layer is the final layer with the number of nodes equal • False Positive (FP): Number of falsely predicted attack
to the number of categories in the output label. In the decision samples.
phase, the soft-max activation function[19] is used by the • True Negative (TN): Number of correctly predicted
dense layer to select the most probable of output classes, and normal samples.
the prediction error is calculated with the help of ’Sparse Cat- • False Negative (FN): Number of falsely predicted normal
egorical Crossentropy’ which is then backpropogated to adjust samples.
the weights of the neural network. The model hyperparameters • Accuracy: Ratio of correctly predicted samples to total
are shown in table 1. samples.
TABLE I
Model hyper-parameters for Stacked LSTM TP + TN
ACC = (4)
No. of LSTM No. of Learning
TP + TN + FP + FN
Dataset
Layers Cells/Layer Epochs Rate
40 128 128 • Precision: Ratio of correctly predicted attack samples to
UNSW-NB15 4 50 0.002
64 total predicted attack samples.
Bot-IoT 2 32 32 5 0.002
TP
PR = (5)
2) Bi-Directional LSTM: In case of Bi-Directional LSTMs, TP + FP
the recurrent network layer is replicated and it works along
• Recall: Ratio of correctly predicted attack samples to
side the first layer. The first layer processes the input sequence,
total number of attack samples
while the reversed copy of the input sequence is fed to the
second layer. Learning from the past instances and the future TP
instances provides more context to the network and results RE = (6)
TP + FN
in better learning. The input layer is a Bi-Directional LSTM
layer which feeds forward a non-sequential output to a Dense • F1 Score: Weighted mean of precision and recall.
Layer. The Dense layer is the final layer with the number
of nodes equal to the number of categories in the output 2 ∗ Recall ∗ P recision
F 1Score = (7)
label. In the decision phase the soft-max activation function Recall + P recision
is used by the dense layer to select the most probable of
output classes, and the prediction error is calculated with • Weighted avg: The data points which have higher fre-
the help of ’Sparse Categorical Crossentropy’ which is then quency contribute more than others.
backpropogated to adjust the weights of the neural network.
The model hyperparameters are shown in table 2. TABLE III
TABLE II Stacked LSTM
Model hyper-parameters for Bi-LSTM
Attack Precision Recall F1
No.s of LSTM No.s of Learning Normal 0.98 1.00 0.99
Dataset Exploits 0.56 0.86 0.67
Layers Cells/Layer Epochs Rate
UNSW-NB15 1 64 50 0.0015 Reconnaissance 0.79 0.51 0.62
Bot-IoT 1 12 5 0.001 DoS 0.87 0.01 0.01
Generic 1.00 0.98 0.99
Shellcode 0.62 0.41 0.49
3) Model Structure and Parameters: The model is trained Fuzzers 0.52 0.29 0.37
Worms 1.00 0.00 0.00
over a number of epochs and training and validation loss Backdoor 1.00 0.00 0.00
decreases gradually with time. The learning process is stopped Analysis 1.00 0.00 0.00
when the number of epochs cross the maximum limit or the Weighted avg 0.97 0.96 0.96
model starts overfitting on the training dataset.
V. RESULTS AND PERFORMANCE ANALYSIS
A. Experimental Setup C. Performance Analysis on UNSW NB-15 Dataset:
We have used the Google Colab’s GPU. The specifications The dataset was splitted into 75% for training and 25%
were Intel(R) Xeon(R) CPU with 2 [email protected] GHz, 12.7 for validation purpose. The training phase consisted of 50
GB of RAM and 78 Gb of Hard Disk space. The version of epochs that lasted over 5 hours. In case of Stacked LSTM,
python installed was 3.7.12 and Tensorflow was 2.7.0. the time taken for the validation phase was 128 seconds with
the processing speed of 0.2 ms/sample and for Bidirectional
B. Evaluation Metrics LSTM, the validation phase took 90 seconds with the pro-
There is no single metric which can accurately tell how good cessing speed of 0.14 ms/sample. The accuracy achieved by
a particular model is. Hence, we have used several metrics to Stacked LSTM was 96.60% and the accuracy achieved by Bi-
evaluate the DL models: directional LSTM was 96.41%
6
TABLE IV
Bi-directional LSTM
TABLE V
Stacked LSTM & Bi-directional LSTM
TABLE VI
Comparison of Results on BoT-IoT and UNSW NB-15 Datasets
R EFERENCES
[1] Jack Steward, The Ultimate List of Internet of Things Statistics for 2022.
URL https://findstack.com/internet-of-things-statistics/
[2] M. Nawir, A. Amir, N. Yaakob, O.B. Lynn, Internet of things(iot):
Taxonomy of security attacks, in: 2016 3rd International Conference on
Electronic Design (ICED), IEEE, 2016, pp. 321–326.