0% found this document useful (0 votes)
104 views10 pages

Intrusion Detection Using Artificial Intelligence On KDD Data Set

Intrusion Detection Using Artificial Intelligence on KDD Data Set

Uploaded by

rg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views10 pages

Intrusion Detection Using Artificial Intelligence On KDD Data Set

Intrusion Detection Using Artificial Intelligence on KDD Data Set

Uploaded by

rg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Intrusion Detection Using Artificial Intelligence on

KDD Data Set


With the Aid of Neural Network and Statistical Analysis.

Raghvendra Giri
Computer Engineering and Media
De Montfort University
Leicester, United Kingdom
[email protected]

Abstract—Developing an efficient Intrusion Detection Program which lasted for nine weeks and TCP dump raw
System (IDS) had attracted many researchers in data was collected by simulating US Air Force LAN as if it
detecting the novel network attacks, as the data is was US Air Force environment.
exponentially increasing and with the new stringent In Section I. will describe the KDDCUP’99 data set
global policies to protect and sharing data in place it is Description, Structure, Attack Categories, Issues/problems of
vital to protect organizations, small scale business like, KDDCUP’99 dataset. Section II describes Feature selection
hospitals, power grids, healthcare grids. Here we are techniques, PCA and perform Naïve Bayes, Random Forest,
using KDDCUP’99 data set for evaluation of intrusion and discuss the Trade-off of reducing the dimensionality to
detection and also we will look the problems of improve performance versus with high dimensions to
KDDCUP’99 dataset and propose development of novel improve high prediction accuracy, correlation with high
efficient IDS with the aid of Artificial Intelligence. detection rate vs normalization of data for better
performance. Section III describes Neural Network
Keywords—Neural Network, KDDCUP’99 data set, Topology, Model I (normal/abnormal), Model II
Intrusion Detection (IDS), Artificial Intelligence. (classification based on 5 main categories), Model III
(classification of 23 categories) and various experiments with
Software Tools Used – 1) MATLAB Programming code different dimension selection and different NN training
for normalization, processing / loading / transformation of functions, error functions used. Section IV describes the
data and Designing Neural Network Topology. 2) R Observations and results of KDDCUP’99 dataset with
Programming Language coding for statistical analysis, statistical analysis and Neural Network. Section V describes
data trends observations, Principal Component Analysis, the Conclusions, propose KDDCUP’99 Features and future
Random Forest, Data Analytics and Visualization. 3) work.
WEKA – to classify/select attribute importance based on
Section I. KDDCUP’99 Dataset
average impurity decrease, Random Forest, PCA.

I. DESCRIPTION OF KDDCUP’99 DATA SET


I. INTRODUCTION
A. Raw Data Set Processing
With the exponential growth of technological advances
and wider use and penetration of internet enabled mobile The raw dataset was in compressed format 4 Giga Bytes
phones and with access to information sharing on social binary TCP dump 7 weeks network traffic 5 million records
media the data is exponentially growing it was estimated that were processed from this data. Similarly, 2 million
from the dawn of the century till 2003 the amount of data connection records of 2 weeks data used as test data
collected now that same amount of data generated every year downloaded from this reference [1] site.
and growing rapidly and lot of business growing globally to
increase their presence they rely on technology to share data
on established network and as the business grow it becomes B. KDDCUP’99 Structure
important to protect the data especially sensitive data which
might be leaked or captured by competitors or healthcare The KDDCUP’99 Data set is divided in to four major
data which is highly sensitive and there are many policies categories; with a slight variation in training and test data
which have evolved to protect the data sharing. To protect with attack types. Training data has 24 attack types and the
this data. Development of efficient intrusion detection system test data set has 14 additional attack types by adding
is vital, earlier with the aid of statistical methods data was additional specific attack types which are not there in test
visualised to predict the intrusion trends, but with the data experts believe that these new attack types share similar
development of efficient high processor in hardware variants to this existing attack types and signature of known
technology (GPU, Multiprocessors, Quantum computing) has attack types would enable to catch new novel attacks. Total
enabled us to use Neural Networks for efficient Intrusion features in KDDCUP’99 are 41 with total 23 attack types
Detection it is vital to detect with high accuracy. divided into 4 main categories: Table I describe 4 main
The dimensions within the dataset selection is important. attack types, Table II describes the no of duplicate rows,
Table III describes no of records within 4 major categories
Here in this paper will discuss various experiments after removing the duplicate records, Table IV describes the
performed on feature selection, normalization, Neural no of 23 attack types and Figure I Describes the distribution
Network (algorithms, weights, error functions, etc), attack of these 23 attack types into 4 main categories.
types, statistical observations on KDDCUP’99 dataset
released by DARPA in 1998 Intrusion Detection Evaluation

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


TABLE I. KDDCUP’99 FOUR MAIN CATEGORIES TABLE IV. DESCRIPTION OF KDD 23 ATTACK TYPES

Serial CATEGORIES

after Removing

Duplicate rows
No of Records
Number Abbreviation Category Type Example

No of Total

Duplicates
1 DOS Denial-of-Service Syn flood

Category

Records
Remote to Local
Guessing

label
2 R2L (unauthorized access from

S.No
Password
a remote machine)
User to Root (unauthorized
Buffer
3 U2R access to local 1 back. 2203 968 1235
superuser(root) privileges)
overflow dos
Probing 2 Buffer u2r 30 30 0
(Scanning/determining Port overflow
4 Probing
weakness or venerability of scanning
remote machine) 3 ftp_write. 8 8 0
r2l
a. KDDCUP’99 Data Set [1].
4 Guess r2l 53 53 0
FIGURE 1. CATEGORY DISTRIBUTION OF KDDCUP’99 DATA SET passwd.
5 imap. r2l 12 12 0
6 ipsweep. probe 1247 651 596
7 land. dos 21 19 2
8 loadmodule. u2r 9 9 0

9 multihop. r2l 7 7 0
FIGURE 2. CATEGORY DISTRIBUTION OF KDDCUP’99 DATA SET 10 neptune. 107201 51820 55381
dos
11 nmap. probe 231 158 73
DISTRIBUTION OF KDD DATASET
dos, 54572,
38.48% 12 normal. u2r 97277 87832 9445
13 perl. r2l 3 3 0
probe, 2131, 14 phf. dos 4 4 0
1.46%
15 pod. probe 264 206 58

normal, 87832,
16 portsweep. u2r 1040 416 624
r2l, 999, 0.69% u2r, 52, 0.04%
60.33%
17 rootkit. probe 10 10 0
18 satan. dos 1589 906 683
19 smurf. r2l 280790 641 280149
20 spy. dos 2 2 0
TABLE II. DUPLICATE RECORDS IN KDDCUP’99 DATA SET
21 teardrop. r2l 979 918 61
Number of Records Count
Total no of Original Rows 494020 22 warezclient. r2l 1020 893 127
Total no of Rows After 23 warezmaster. 20 20 0
Removing Duplicates 145586
normal

No of Duplicate Rows
Removed 348434
Total 494020 145586 348434
TABLE III. CATEGORY WISE DATA IN KDDCUP’99 DATA SET

S.NO ATTACK NO OF ROWS AFTER REMOVING


C. KDDCUP’99 Attack Types
CATEGORIES DUPLICATES

1 NORMAL 87832 The KDDCUP’99 dataset is classified in to 23 attack


2 U2R 52 types in training dataset and 37 attack types in test dataset
these additional 14 attack types in test data has similar
3 R2L 999 characteristics which come under 23 attack types and our
4 PROBE 2131 trained model should be able to identify based on the 23
attack types in testing dataset. As we have seen in the
5 DOS 54572
KDDCUP’99 data structure and Table I to Table IV the
distribution of data is skewed as Neptune, normal and smurf
attack type contribute a large number of rows in dataset and
the least no of rows spy, phf, perl, multihop, loadmodule,
ftp_write which have less than 10 records in the entire TABLE IX. CATEGORY U2R
dataset and for effective classification of attack types into
User to Root 4 Types No of Rows
total 23 classification would be difficult due to the above
reasons we will see this in our experiment section. For buffer_overflow u2r 30
effective classification of dataset in Neural Network and for loadmodule u2r 9
training Neural Network we need ample number of datasets
for all attack types. In another experiment performed below perl u2r 3
the 23 attack types are sub classified into 5 main attack rootkit u2r 10
types as demonstrated from Table V to Table IX. The attack
distribution in train KDDCUP’99 dataset and test dataset are Total 52
different.
D. ISSUES OF KDDCUP’99 DATA SET
TABLE V. CATEGORY NORMAL
As we can observe from the above Table V to Table IX
Normal 1 Type No of Rows the distribution of attack types is dissimilarly supplied in
1 Normal 87832 dataset for three categories User to Root (U2R) and Root to
Local (R2L) and Probe attacks. U2R total number of records
Total 87832 are 52 which contribute only 0.04% and R2L total number of
records are 999 which contribute only 0.69% and Probe total
number of records 2131 which contribute only 1.46% where
as Denial of Service (DOS) total number of records 54572
TABLE VI. CATEGORY DOS which contributes 38.48% and Normal total number of
records 87832 which contributes 60.33% of total
Denial of Service 6 Types No of Rows KDDCUP’99 training data set.
1 back Dos 968 Keeping in view of above skewed distribution of data for
pattern classification, training of neural network to
2 land Dos 19 demonstrate misuse detection on KDDCUP’99 dataset for 23
3 neptune Dos 51820 classifications performance will not yield up to the level
expected due to less no of records available in training
4 pod Dos 206 KDDCUP’99 dataset for U2R+R2L+Probe. If we combine
5 smurf Dos 641 U2R+R2L+Probe it would still be 2.19% whereas
DOS+Normal is 97.81%. So, it would not be feasible enough
6 teardrop Dos 918 to train pattern classification / neural network to demonstrate
Total 54572 misuse detection at an acceptable level of performance on
KDD testing dataset for 3 categories User to Root (U2R),
TABLE VII. CATEGORY PROBE Probe and Remote to Local (R2L) attacks. In this paper in the
experiments model III 23 individual classification performed
Probe 4 Types No of Rows we can observe this detection accuracy differences between
U2R+R2L+Probe vs DOS and Normal. Denial of Service
1 ipsweep probe 651 attack and Normal this both records perform better as they
2 nmap probe 158 have sufficient number of records for training the Neural
Network and in experiment performed below suggest some
3 portsweep probe 416 solutions for these problems. Whereas if this attack are
4 satan probe 906 classified into model I classified smurf and normal in to 2
(attack or normal) and model II classified 23 attack types into
Total 2131 5 main groups (Normal, U2R, R2L, Probe) as described from
Table V. to Table IX. Model I and Model II performance is
TABLE VIII. CATEGORY R2L with high accuracy and Model III has some setback which
we will look in our experiments below. Model III is
Remote to Local 8 Types No of Rows classified in to 23 categories.
1 ftp_write r2l 8 Because of skewed data distribution problem in
2 guess_passwd r2l 53 KDDCUP’99 dataset U2R+R2L+Probe fall under minority
group contains significantly fewer number of records than
3 imap r2l 12 the DOS+Normal which fall under majority group and
4 multihop r2l 7 because of this KDDCUP’99 dataset suffer class
imbalance problem.
5 Phf r2l 4
6 Spy r2l 2
Section II. Feature Selection and Data Processing on
7 warezclient r2l 893 KDDCUP’99 Dataset
8 warezmaster r2l 20
In this paper the process flow diagram on KDDCUP’99
Total 999 dataset is displayed in Figure 3. The processing and the
analysis is done with MATLAB 2019, R Language, Weka
Tool on Intel Pentium i7 processor with 16 GB RAM
installed. The data set had undergone the process flow as F. Data Set Filtering.
described in Figure 3.
After Pre-Processing the KDD dataset, we need to filter
the data. By using R Programming language, we have done
E. Data Set Pre-Processing.
some filtering and data analytics for KDDCUP’99 data
visualization. First, have labelled the columns of dataset
The KDD dataset downloaded [1]. Is in raw Format on then we processed the check if any NA values available in
observation this data file is separated with comma ( , ) which KDD data set the output was ‘0’ which means we don’t have
is comma separated file (csv). The header is in a separate any columns which has NA or blank and if there is any
file names.txt which is in row(vertical) format we have blank cell in the data based on the skewness and variation in
transposed it to fit in to KDD 10% corrected testing data file data either we can use mean, median or mode to insert that
as header and as the last column label is blank in names.txt blank column to fill the missing data as we don’t have any
we have labelled the last 42nd column as label. On careful missing data in KDD data set we have not applied any of
observation on the dataset we could asses in Table I. to these technique. The next thing was to visualize the KDD
Table IX. That this data set has huge number of duplicate data label distribution after executing barplot code in R we
records which needs to be removed to train our Neural can see the distribution of data in Figure 4.
Network model else overfitting of data will occur. Duplicate
rows were removed by using unique function in R and also Filtering the data has many stages to find the zero
with excel file option. variance within the columns (features). When applied zero
variance filter, have identified column number (1 6 7 8 9 10
The next step is to identify the nominal columns in the 11 13 14 15 16 17 18 19 20 21 22 29 31 32 33 34 37) and
collected data. The column number 2,3,4,42 identified as remove those columns, this means we can remove these
nominal columns (2-protocol_type, 3-service, 4-flag, 42- columns from our dataset to reduce the dimensionality of
Label which is our target variable). To train any machine our dataset. After removing the above columns, we have
learning model we need all the data in numeric form so we only 19 columns which are (2 3 4 5 12 23 24 25 26 27 28 30
applied and converted nominal columns data to numeric data 35 36 38 39 40 41 42).
to change this raw data to machine understandable form.
For example, the near zero variance predictor is one that,
In KDD data set we have found 42 Features and total for 1000 samples, has two distinct values and 999 of them
494020 records and after removing duplicates we have are a single value [5].
145586 number of rows of data. In target variable label we
have 23 attack types identified and we have classified these FIGURE 4. KDDCUP’99 DATA SET ATTACKS DISTRIBUTION
23 attack types into 4 main categories Dos, R2L, U2R,
Normal.

FIGURE 3. KDDCUP’99 DATA SET PROCESS FLOW DIAGRAM

Generated using R Studio.

G. KDD Data Normalization.


In normalization all the records values are changed
between 0 and 1. Here normalization is done as different
features in the dataset do not have same range of values and
because of this gradient decent may take more time and
oscillate back forth which will take a long time before it can
find global or local minima and to overcome the NN model
learning problem it is important to normalize the data so that
we make sure the different features take same range of
values so that gradient decent can converge fast. The model
accuracy improves with the normalized data.

Once the data is cleansed by removing duplicate rows,


removing zero variance columns , checking data for blank
cells and NA columns, replacing nominal columns with the TABLE XI. RANDOM FOREST RESULTS
numeric data then normalizing the data we can load this data dos normal probe r2l u2r
to MATLAB for processing and then transform the data in
MATLAB and simultaneously load the target class data for Dos 99.99 0 0 0 0
processing. normal 0.03 99.79 0.04 0.1 0.05
probe 0.3 0.38 99.27 0.05 0
By normalizing the data, we can achieve high
r2l 0 1.39 0 98.61 0
performance with high dimensions when we have many
classification to be done in data set like in KDD where we u2r 0 0 0 25 75
have 23 classification it is important to keep features which
can build better model as low dimensionality suffers from Result: The efficiency of Random Forest is almost 99%.
high prediction rate where as high dimensions with
important feature will gives better prediction accuracy. J. Feature Selection, Data Cleaning and Dimensionality
selection.
In our experiment below we have classified data in to
three models. In the first mode I. we classify data in to 2 After applying various filters (duplicate records removal,
(normal / anomaly) and in model II 23 classification unique records identification, zero variance columns (eg:
grouped in to 5 main categories (Normal, U2R, R2L, Probe, column 20 and 21), cleaning of dataset, Feature selection
Dos) and in the model III classify data into 23 classifications with Random Forest, Principal Component Analysis, Weka
based on labels in column 42. tool attribute selection and R language library feature
selection with variance. By applying all the techniques, we
The test dataset Kdd.data Full dataset has 4898430 got features as listed in Table XII. Which we will use and
records with observations 42 features and the train data has test in our Model I, Model II and Model II of Neural
494020 records with 42 features. Network.

TABLE XII. FEATURE SELECTION AFTER DIMENSIONALITY REDUCTION


H. Navie Bayer.
Built Navie Bayes model after removing and applying Code Column Names/Numbers Dimentions
near zero variance filter in R using package library e1071. PC12 2,13,22,24,25,28,30,31,32,33,37,38 12
ALLD 2,3,5,6,23,24,39 7
On features (2 3 4 5 12 23 24 25 26 27 28 30 35 36 38 39 40 HDR1 3,5,6,39 4
41 42) total 19 features. HDR2 3,4,5,6,14,16,27,28,37,39 10
IGR 2,3,5,6,23,24,33,34,35,36 10
ME 2,3,4,5,6,12,14,16,21,23,24,27,28,29,30,31,32,33,34,35,36,37,39 23
The accuracy of Naïve Bayes output is 64% on the above PCA 4,21,23,24,27,31,32,36,37,40 10
selected feature set. WEKA 2,3,4,5,6,7,8,14,23,30,36 11
WEKA_RF
27.28.4.24.21.23.37.32.13.5.3.31.8.36,6 15
R_RF 24,25,4,38,12,6,5,14,31,1,34,33,30,37,21,10,32,35,36,23,28,22,29,2,41 25
I. Random Forest.
The above Table XIII display various features from low
Random forest algorithm performed on 25 features listed
dimensionality to high dimensionality extracted using pca,
here under.
weka, feature extraction algorithm etc. (HDR1, HDR2, IGR,
WEKA, WEKA_R) -> weka (PCA12, PCA, R_RF) -> R.
List of 25 features selected for Random Forest.
(ALLD, ME) -> manual selection by data analysis using R.
(srv_rerror_rate rerror_rate flag
dst_host_rerror_rate logged_in dst_bytes
Section III. Neural Network Topology for KDDCUP’99
src_bytes num_compromised Dataset
dst_host_srv_count duration
dst_host_same_src_port_rate dst_host_diff_srv_rate In building model’s, I to III MATLAB was used to
dst_host_count dst_host_srv_serror_rate count hot build Neural Network, Data transformations, Data
dst_host_same_srv_rate Normalization, plotting confusion matrix, ROC etc. WEKA
dst_host_srv_diff_host_rate dst_host_serror_rate tool was used for feature selection and for Naïve Bayes,
serror_rate srv_serror_rate diff_srv_rate srv_count random forest algorithm. R programming language was use
srv_diff_host_rate protocol_type result ). for principal component analysis, Data analytics,
visualization and to check which features are influencing the
5 main categories and variance filtrations etc MS Excel for
basic data manipulations.
TABLE X. RANDOM FOREST PREDICTION
A. Neural Network Model I. (Classification either attack or
Pred dos normal probe r2l u2r normal)
a) In model I we have segregated the dataset by taking
dos 352278 16 8 2 0 50% of smurf attack types and 50% of normal attack types
normal 23 87507 31 85 43 and clubbed together to classify and test. Here 641 records
of normal category and 641 records of smurf category were
probe 11 14 3657 2 0 taken and clubbed these records and created 1282 records of
r2l 0 13 0 923 0 sample dataset for testing. Model I dataset details are given
u2r 0 0 0 1 3 below.
TABLE XIII. MODEL I DATASET DETAILS FIGURE 6. MODEL I (ROC) RECEIVER OPERATION CHARACTERISTICS

Data / Records smurf normal


Total records 280790 97278
Unique records 641 87832
Duplicate records 20149 9446
Model I 641 641

b) After applying pre-processing techniques as


discussed in above section the data is automatically loaded
by single script into MATLAB which will load the dataset
and transform the data further data is divided into training,
validations and testing sets. This script has options to
modify or change network training functions or error
functions and will display the Receiver Operating
characteristics (ROC), Confusion Matrix, Error Histogram,
Training state (gradient and validation checks), Performance
(best validation performance after certain epochs). f) Model I. Performance.

c) Neural Net Topology Percentage Correct Classification: 100.000%


Here to train and build neural network for target Percentage Incorrect Classification: 0.000%
dataset and sample dataset was supplied to neural net with B. Neural Network Model II. (Classification of 23 attack
hidden layer size of 10 and Scaled conjugate training types grouped into main 4 categories)
function was used for training the NN and input output
pre/post processing functions removeconstantsrows and
mapminmax was used. For division of data dividerand a) In model II the KDDCUP’99 dataset 23 attack types
function was used to divide data randomly with divide mode were categorised based on the segregations of attack types in
to sample and train ration was set to 70% validation set was to 5 main groups as listed from Table V. to Table IX. The 5
set to 15 percent and test set was set to 15 %. For main categories are Normal, R2L, U2R, Probe, Dos. Normal
Performance metrics Crossentropy function was used and has only 1 sub type. R2L has 8 sub types. U2R has 4 sub
ploting functions plotperform, plottrainstate, ploterrhist, types. Probe have 4 sub types and Dos has 6 sub type attack
plotconfusion, plotroc were used to plot the results/output of types as listed in Table V to Table IX.
Neural Network.
b) The pre-processed data set after applying data
d) After running the model I we have achieved 100% cleaning methods as described in Section II. We have got
detection rate by classifying normal and attack as shown in list of 10 Feature selection options which is coded some are
Figure 5. Confusion matrix the two output classes correctly with high dimensionality reduction with less features and
classified with two target classes. few are features extracted after applying principal
component analysis, variance selection, weka attribute
e) In the above Figure 5 the target class and the output selection tool. we will use this pre-processed data which has
classes were correctly classified with 100% true positives. In been normalised and converted from nominal to numeric to
Figure 6 the ROC plot we can observe the class 1 and class train and test on Neural Network and we will test all the
2 plotted on Y axis from 0 to 1 vertically straight without features listed in Table XII.
any divergence or false positives.
c) With MATLAB code data is automatically loaded in
FIGURE 5. CONFUSION MATRIX OF MODEL I to MATLAB transformed and processed both sample and
target data to train neural networks with different setting for
neural network.

d) Neural Network Topology


In this Model II the output and the target is divided
in to 5 classifications and the dataset was supplied to neural
network with options a) 1 hidden layer with 10 neurons and
b) later modified with 1 hidden layer with 30 neurons and c)
2 hidden layers 1st layer with 30 neurons and 2nd layer with 5
neurons. Option b) and c) gave optimal results. The training
functions used was scaled conjugate gradient
backpropagation function (SCG) this function updates
weights and bias values according to the scaled conjugate
gradient method. It was used for training the network and
cross entropy function was used as performance function
data was divided in to train 70% validation 15% testing
15%. All the parameters can be modified to change
functions or setting of the neural network.
Bayesian regularization backpropagation was also e) In the above Figure 6. Scaled conjugate gradient
tested as training function. It updates the weights and bias backpropagation function was used to train the neural
values according to Levengerg-Marquardt optimization. It network achieved optimal performance in less time with
minimizes a combination of squared errors and weights, and labelled datasets (supervised) and in Figure 7. And Figure 8.
then determines the correct combination so as to produce a Bayesian regularization backpropagation function was used
network that generalizes well. The process is called to train the neural network without the labelled dataset
Bayesian regularization [7]. This function run on Pentium i7 which took long time after but performed better with out
with 2 core 4 logical processors with 16 GB ram. labelled dataset all the categories Normal, R2L, Probe, Dos,
achieved optimal performance 97% to 99% but U2R which
Trainbr function takes longer but it is better for suffers with less dataset (rows) was misclassified and
challenging problems and the results were optimal. achieved 40.8 % even though the neural network was
supplied with unlabelled dataset it performed better in terms
FIGURE 6. CONFUSION MATRIX OF MODEL II SCALED CONJUGATE of classification.
GRADIENT BACKPROPAGATION (SCG)
FIGURE 8. ROC PLOT OF MODEL II BAYESIAN REGULARIZATION
BACKPROPAGATION (BRB)

MODEL II CONFUSION MATRIX 100% SCG

TABLE XIV. MODEL II DATASET DETAILS

Attack Type No of Records Percent f) In Figure 6. We can absorb that all the 23 attack
types which were categorised in to 5 main categories
normal 87832 60.330%
classified achieved the accuracy of 100% .
u2r 52 0.036%
r2l 999 0.686% g) Model II Performance
Probe 2131 1.464% Percentage Correct Classification: 100.000%
dos 54572 37.484% Percentage Incorrect Classification: 0.000%

FIGURE 7. CONFUSION MATRIX OF MODEL II BAYESIAN C. Neural Network Model III. (Classification of 23 attack
REGULARIZATION BACKPROPAGATION (BRB)
types)

a) In Model III here tested with two data sets. Original


and synthetic(duplicate) dataset.

1.Model III experiment 1. Original dataset without


duplicate records. Which suffers class imbalance problem.
We will resolve this problems of class imbalance in dataset
and apply one of the solutions for solving this problem in
Model III experiment 2.

2.Model III experiment 2. Original dataset with


equal no of instances in all 23 categories each category with
100 records. The no of instances in all 23 categories ranged
from 2 instances(records) to 6000 plus instances. Few of the
categories where having only 2 records. created duplicate
records similar to synthetic data sets of original and
replicated to 100 records. All 23 categories have 100 records
each. Here we have applied one of the solutions for class FIGURE 11. ROC MODEL III EXPERIMENT 2
imbalance problem to test our model performance.

b) Neural Network Topology


In this model 2 hidden layers the first layer with 30
neurons the second layer with 23 neurons the training
function used is scaled conjugate gradient backpropagation
function with cross entropy as performance function and
plotted confusion matrix, ROC, error histogram and
validation plots.

FIGURE 9. CONFUSION MATRIX MODEL III EXPERIMENT 1

Section IV. Observations and Statistical Analysis of


KDDCUP’99 Dataset
In Figure 9. As there few of the classes fall under
minority group and have only 2 rows to 10 rows per
class only those categories were misclassified and As KDDCUP’99 dataset is skewed, there is data
model could not achieve the optimal performance. distribution problem in KDDCUP’99 dataset
U2R+R2L+Probe are minority group contains significantly
FIGURE 10. CONFUSION MATRIX MODEL III EXPERIMENT 2 fewer number of records than the DOS+Normal which is
majority group because of these different distribution of data
KDDCUP’99 dataset suffer class imbalance problem.
The author [6]. Justin M. Johnson * and Taghi M.
Khoshgoftaar have done survey on deep learning with class
imbalance [6].
(Justin M. Johnson * and Taghi M. Khoshgoftaar )[6].
(2019) on Machine learning techniques for class imbalanced
data: Addressing class imbalance with traditional machine
learning techniques has been studied extensively over the last
two decades [6]. The bias towards the majority class can be
alleviated by altering the training data to decrease imbalance,
or by modifying the model’s underlying learning or decision
process to increase sensitivity towards the minority group
[6]. As such, methods for handling class imbalance are
grouped into data-level techniques, algorithm-level methods,
and hybrid approaches [6].
Principal component analysis in Figure 12 red lines
represent columns and the black clusters represent the rows.
The red arrows which are pointing towards the same
direction and which are close to each other have more
correlation among them the plotted graph gives us the clear
In Figure 10. Of Model III experiment 2 all the 23 visualization of PC1 and PC2.
classes were correctly classified and the model
performed well with 99.43% accuracy. FIGURE 12. PRINCIPAL COMPONENT ANALYSIS

c) Model III Performance.

Percentage Correct Classification : 99.434783%


Percentage Incorrect Classification: 0.565217%
Section V. Conclusion and future work. [5] Zero Variance Filter URL [online]
https://www.rdocumentation.org/packages/caret/versions/6.0-
In this paper three models were built with different 82/topics/nearZeroVar .
parameter on Neural Network and their performance matrix [6] Survey on deep learning with class imbalance Justin M. Johnson *
were captured. Experiments based on kddcup99 dataset are and Taghi M. Khoshgoftaar Johnson and Khoshgoftaar J Big Data
encouraging results. (2019) 6:27 https://doi.org/10.1186/s40537-019-0192-5 .
[7] MATLAB Software Documentation. [online ].
All the three models were tested with different training [8] R Language Software Documentation [online].
functions trainscg(Scaled conjugate gradient [9] Weka Software tool Documentation [online].
backpropagation.), trainbr(Bayesian regularization [10] Accuracy Paradox [online].
backpropagation), trainlm(Levenberg-Marquardt https://en.wikipedia.org/wiki/Accuracy_paradox
backpropagation), newbr(radial biases function ) and
different epocs settings, the design of software is such that Appendix I.
all the different parameters can be changed like number of
The Training dataset (kddcup_10%_labeled) has 23
layers, number of neurons, error functions, etc. Model I
attack types and the Test dataset (corrected) 38 attack types
gives 100% accuracy as datasets are balanced it was a binary
with 21 attack types duplicated in both the data set and
classification, Model II in first experiment Figure 7 and 8 overall 40 unique attack types combined in training and
which was run as unsupervised learning using Bayesian testing dataset which accounts for 17 new attack types in
regularization backpropagation which showed optimal training dataset and 2 attack types spy and warezclient which
results and in experiment II Scaled conjugate gradient are there only in testing dataset. Finally 21 duplicate attack
backpropagation which gave 100% accuracy and model III types in both test and train dataset 2 unique in training and 17
in experiment 1 original dataset without duplicate records unique in testing dataset which comes down to 21+17+2 = 40
was tested due to class imbalance problem minority class attack types.
suffered misclassification and majority class was classified
99% accuracy. In Experiment 2 synthetic data was added
TABLE XV. KDD TRAIN AND TEST DATASET ATTACK TYPES
manually and tested the results by balancing the datasets got
optimal results with 100% accuracy in all classes. S.No Train Test_Dataset Unique
Records
Dataset Train+Test
There are lot of scope for further evaluation of
dataset with the aid of Hadoop, Scala for better performance 1 normal normal. normal
while dealing with huge datasets and apply deep learning
2 back snmpgetattack. back
methods on KDDCUP’99 dataset and apply class imbalance
techniques on dataset to improve classification accuracy of 3 Buffer named. Buffer
minority classes. overflow overflow
4 ftp_write xlock. ftp_write
Another dataset is also available NSL-KDD dataset as
stated on its website its main purpose is to have a stable equal 5 guess_passwd smurf. guess_passwd
distribution of dataset for the research. 6 imap ipsweep. imap
This work can be further extended as KDD dataset suffer 7 ipsweep multihop. ipsweep
class imbalance problem so synthetic data set can be inserted.
8 land xsnoop. land
Based on the above observations we propose Features
Selection for KDD dataset in Table XII. High Dimension 9 loadmodule sendmail. loadmodule
Reduction features reduces high accuracy of attack detection 10 multihop Guess passwd. multihop
where as optimal features detect all the attack types.
11 neptune saint. neptune
KDD data set also suffers from giving high Accuracy
results 99% accuracy and it’s a major problem with this 12 nmap buffer_overflow. nmap
dataset as model can predict value of majority class for all 13 perl portsweep. perl
predictions and achieve a high classification accuracy with a
large class imbalance dataset which is known as Accuracy 14 phf pod. phf
Paradox [10]. In this data set Majority class is giving us high 15 pod apache2. pod
accuracy rate in such cases precision and recall can help but
precision can also be biased by very unbalanced classes. 16 portsweep phf. portsweep
17 rootkit udpstorm. rootkit
REFERENCES 18 satan warezmaster. satan
[1] The KDD-CUP 1999 dataset published on this website [online]
Available: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html 19 smurf perl. smurf
[2] Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA 20 spy satan. spy
Workbench. Online Appendix for "Data Mining: Practical Machine
Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 21 teardrop xterm. teardrop
2016..
[3] MATLAB and Statistics Toolbox Release 2014a The MathWorks, 22 warezclient mscan. warezclient
Inc., Natick, Massachusetts, United States. 23 warezmaster processtable. warezmaster
[4] R Core Team (2013). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria. 24 ps. snmpgetattack
URL http://www.R-project.org .
25 nmap. named
26 rootkit. xlock
27 neptune. xsnoop
28 loadmodule. sendmail
29 imap. saint
30 back. apache2
31 httptunnel. udpstorm
32 worm. xterm
33 mailbomb. mscan
34 ftp_write. processtable
35 teardrop. ps
36 land. httptunnel Appendix III. Evaluators and algorithums performed on
WEKA
37 sqlattack. worm
1.kMeans
38 snmpguess. mailbomb
Scheme: weka.clusterers.SimpleKMeans -init 0 -max-
39 sqlattack candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1
40 snmpguess -1.25 -t2 -1.0 -N 2 -A "weka.core.EuclideanDistance -R first-
last" -I 500 -num-slots 1 -S 10
Relation: X_Sample
Appendix II. Instances: 1282
Attributes: 42
Data Analytics on KDD dataset using R Time taken to build model (full training data) : 0.08 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 641 ( 50%)
1 641 ( 50%)

2.attributeSelection.CorrelationAttributeEval
Evaluator: weka.attributeSelection.CorrelationAttributeEval
Search: weka.attributeSelection.Ranker -T -
1.7976931348623157E308 -N -1
Relation: X_Sample
Instances: 1282
Attributes: 42
Exploratory Analysis
Search Method:
Observation 1 : in the above image we can see that Attribute ranking.
dst_host_same_src_port_rate got slight effect on the Attribute Evaluator (supervised, Class (numeric): 42
intrusion type and dst_host_same_src_port_rate value more label):
than equal to 1 the attack type can be probe and r2l. Correlation Ranking Filter
Ranked attributes:
0.991 2 protocol_type
Observation 2 : In this observation we can see that flag is 0.9876 12 logged_in
a strong predictor. for flag= REG and S0 it is dos 0.885 3 service
0.576 37 dst_host_srv_diff_host_rate
0.3157 31 srv_diff_host_rate
0.3152 6 dst_bytes
0.2767 34 dst_host_same_srv_rate
0.1951 39 dst_host_srv_serror_rate
0.1789 33 dst_host_srv_count
0.0477 30 diff_srv_rate
0.0425 26 srv_serror_rate
0.0397 25 serror_rate
0.0395 10 hot
0.0394 1 duration
3.Evaluator:
weka.attributeSelection.ClassifierAttributeEval -execution-
slots 1 -B weka.classifiers.rules.ZeroR -F 5 -T 0.01 -R 1 -E
Observation 3 : for the duration more than 30000 we can DEFAULT --Search: weka.attributeSelection.Ranker -T -
say that it's probe so duration is a strong predictor. 1.7976931348623157E308 -N -1
Relation: X_Sample Instances: 1282 Attributes: 42

You might also like