0% found this document useful (0 votes)

104 views10 pages

Intrusion Detection Using Artificial Intelligence On KDD Data Set

Intrusion Detection Using Artificial Intelligence on KDD Data Set

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views10 pages

Intrusion Detection Using Artificial Intelligence On KDD Data Set

Intrusion Detection Using Artificial Intelligence on KDD Data Set

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Intrusion Detection Using Artificial Intelligence on

KDD Data Set

With the Aid of Neural Network and Statistical Analysis.

Raghvendra Giri
Computer Engineering and Media
De Montfort University
Leicester, United Kingdom
[email protected]

Abstract—Developing an efficient Intrusion Detection Program which lasted for nine weeks and TCP dump raw
System (IDS) had attracted many researchers in data was collected by simulating US Air Force LAN as if it
detecting the novel network attacks, as the data is was US Air Force environment.
exponentially increasing and with the new stringent In Section I. will describe the KDDCUP’99 data set
global policies to protect and sharing data in place it is Description, Structure, Attack Categories, Issues/problems of
vital to protect organizations, small scale business like, KDDCUP’99 dataset. Section II describes Feature selection
hospitals, power grids, healthcare grids. Here we are techniques, PCA and perform Naïve Bayes, Random Forest,
using KDDCUP’99 data set for evaluation of intrusion and discuss the Trade-off of reducing the dimensionality to
detection and also we will look the problems of improve performance versus with high dimensions to
KDDCUP’99 dataset and propose development of novel improve high prediction accuracy, correlation with high
efficient IDS with the aid of Artificial Intelligence. detection rate vs normalization of data for better
performance. Section III describes Neural Network
Keywords—Neural Network, KDDCUP’99 data set, Topology, Model I (normal/abnormal), Model II
Intrusion Detection (IDS), Artificial Intelligence. (classification based on 5 main categories), Model III
(classification of 23 categories) and various experiments with
Software Tools Used – 1) MATLAB Programming code different dimension selection and different NN training
for normalization, processing / loading / transformation of functions, error functions used. Section IV describes the
data and Designing Neural Network Topology. 2) R Observations and results of KDDCUP’99 dataset with
Programming Language coding for statistical analysis, statistical analysis and Neural Network. Section V describes
data trends observations, Principal Component Analysis, the Conclusions, propose KDDCUP’99 Features and future
Random Forest, Data Analytics and Visualization. 3) work.
WEKA – to classify/select attribute importance based on
Section I. KDDCUP’99 Dataset
average impurity decrease, Random Forest, PCA.

I. DESCRIPTION OF KDDCUP’99 DATA SET

I. INTRODUCTION
A. Raw Data Set Processing
With the exponential growth of technological advances
and wider use and penetration of internet enabled mobile The raw dataset was in compressed format 4 Giga Bytes
phones and with access to information sharing on social binary TCP dump 7 weeks network traffic 5 million records
media the data is exponentially growing it was estimated that were processed from this data. Similarly, 2 million
from the dawn of the century till 2003 the amount of data connection records of 2 weeks data used as test data
collected now that same amount of data generated every year downloaded from this reference [1] site.
and growing rapidly and lot of business growing globally to
increase their presence they rely on technology to share data
on established network and as the business grow it becomes B. KDDCUP’99 Structure
important to protect the data especially sensitive data which
might be leaked or captured by competitors or healthcare The KDDCUP’99 Data set is divided in to four major
data which is highly sensitive and there are many policies categories; with a slight variation in training and test data
which have evolved to protect the data sharing. To protect with attack types. Training data has 24 attack types and the
this data. Development of efficient intrusion detection system test data set has 14 additional attack types by adding
is vital, earlier with the aid of statistical methods data was additional specific attack types which are not there in test
visualised to predict the intrusion trends, but with the data experts believe that these new attack types share similar
development of efficient high processor in hardware variants to this existing attack types and signature of known
technology (GPU, Multiprocessors, Quantum computing) has attack types would enable to catch new novel attacks. Total
enabled us to use Neural Networks for efficient Intrusion features in KDDCUP’99 are 41 with total 23 attack types
Detection it is vital to detect with high accuracy. divided into 4 main categories: Table I describe 4 main
The dimensions within the dataset selection is important. attack types, Table II describes the no of duplicate rows,
Table III describes no of records within 4 major categories
Here in this paper will discuss various experiments after removing the duplicate records, Table IV describes the
performed on feature selection, normalization, Neural no of 23 attack types and Figure I Describes the distribution
Network (algorithms, weights, error functions, etc), attack of these 23 attack types into 4 main categories.
types, statistical observations on KDDCUP’99 dataset
released by DARPA in 1998 Intrusion Detection Evaluation

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

TABLE I. KDDCUP’99 FOUR MAIN CATEGORIES TABLE IV. DESCRIPTION OF KDD 23 ATTACK TYPES

Serial CATEGORIES

after Removing

Duplicate rows
No of Records
Number Abbreviation Category Type Example

No of Total

Duplicates
1 DOS Denial-of-Service Syn flood

S.NO ATTACK NO OF ROWS AFTER REMOVING

C. KDDCUP’99 Attack Types
CATEGORIES DUPLICATES

1 NORMAL 87832 The KDDCUP’99 dataset is classified in to 23 attack

2 U2R 52 types in training dataset and 37 attack types in test dataset
these additional 14 attack types in test data has similar
3 R2L 999 characteristics which come under 23 attack types and our
4 PROBE 2131 trained model should be able to identify based on the 23
attack types in testing dataset. As we have seen in the
5 DOS 54572
KDDCUP’99 data structure and Table I to Table IV the
distribution of data is skewed as Neptune, normal and smurf
attack type contribute a large number of rows in dataset and
the least no of rows spy, phf, perl, multihop, loadmodule,
ftp_write which have less than 10 records in the entire TABLE IX. CATEGORY U2R
dataset and for effective classification of attack types into
User to Root 4 Types No of Rows
total 23 classification would be difficult due to the above
reasons we will see this in our experiment section. For buffer_overflow u2r 30
effective classification of dataset in Neural Network and for loadmodule u2r 9
training Neural Network we need ample number of datasets
for all attack types. In another experiment performed below perl u2r 3
the 23 attack types are sub classified into 5 main attack rootkit u2r 10
types as demonstrated from Table V to Table IX. The attack
distribution in train KDDCUP’99 dataset and test dataset are Total 52
different.
D. ISSUES OF KDDCUP’99 DATA SET
TABLE V. CATEGORY NORMAL
As we can observe from the above Table V to Table IX
Normal 1 Type No of Rows the distribution of attack types is dissimilarly supplied in
1 Normal 87832 dataset for three categories User to Root (U2R) and Root to
Local (R2L) and Probe attacks. U2R total number of records
Total 87832 are 52 which contribute only 0.04% and R2L total number of
records are 999 which contribute only 0.69% and Probe total
number of records 2131 which contribute only 1.46% where
as Denial of Service (DOS) total number of records 54572
TABLE VI. CATEGORY DOS which contributes 38.48% and Normal total number of
records 87832 which contributes 60.33% of total
Denial of Service 6 Types No of Rows KDDCUP’99 training data set.
1 back Dos 968 Keeping in view of above skewed distribution of data for
pattern classification, training of neural network to
2 land Dos 19 demonstrate misuse detection on KDDCUP’99 dataset for 23
3 neptune Dos 51820 classifications performance will not yield up to the level
expected due to less no of records available in training
4 pod Dos 206 KDDCUP’99 dataset for U2R+R2L+Probe. If we combine
5 smurf Dos 641 U2R+R2L+Probe it would still be 2.19% whereas
DOS+Normal is 97.81%. So, it would not be feasible enough
6 teardrop Dos 918 to train pattern classification / neural network to demonstrate
Total 54572 misuse detection at an acceptable level of performance on
KDD testing dataset for 3 categories User to Root (U2R),
TABLE VII. CATEGORY PROBE Probe and Remote to Local (R2L) attacks. In this paper in the
experiments model III 23 individual classification performed
Probe 4 Types No of Rows we can observe this detection accuracy differences between
U2R+R2L+Probe vs DOS and Normal. Denial of Service
1 ipsweep probe 651 attack and Normal this both records perform better as they
2 nmap probe 158 have sufficient number of records for training the Neural
Network and in experiment performed below suggest some
3 portsweep probe 416 solutions for these problems. Whereas if this attack are
4 satan probe 906 classified into model I classified smurf and normal in to 2
(attack or normal) and model II classified 23 attack types into
Total 2131 5 main groups (Normal, U2R, R2L, Probe) as described from
Table V. to Table IX. Model I and Model II performance is
TABLE VIII. CATEGORY R2L with high accuracy and Model III has some setback which
we will look in our experiments below. Model III is
Remote to Local 8 Types No of Rows classified in to 23 categories.
1 ftp_write r2l 8 Because of skewed data distribution problem in
2 guess_passwd r2l 53 KDDCUP’99 dataset U2R+R2L+Probe fall under minority
group contains significantly fewer number of records than
3 imap r2l 12 the DOS+Normal which fall under majority group and
4 multihop r2l 7 because of this KDDCUP’99 dataset suffer class
imbalance problem.
5 Phf r2l 4
6 Spy r2l 2
Section II. Feature Selection and Data Processing on
7 warezclient r2l 893 KDDCUP’99 Dataset
8 warezmaster r2l 20
In this paper the process flow diagram on KDDCUP’99
Total 999 dataset is displayed in Figure 3. The processing and the
analysis is done with MATLAB 2019, R Language, Weka
Tool on Intel Pentium i7 processor with 16 GB RAM
installed. The data set had undergone the process flow as F. Data Set Filtering.
described in Figure 3.
After Pre-Processing the KDD dataset, we need to filter
the data. By using R Programming language, we have done
E. Data Set Pre-Processing.
some filtering and data analytics for KDDCUP’99 data
visualization. First, have labelled the columns of dataset
The KDD dataset downloaded [1]. Is in raw Format on then we processed the check if any NA values available in
observation this data file is separated with comma ( , ) which KDD data set the output was ‘0’ which means we don’t have
is comma separated file (csv). The header is in a separate any columns which has NA or blank and if there is any
file names.txt which is in row(vertical) format we have blank cell in the data based on the skewness and variation in
transposed it to fit in to KDD 10% corrected testing data file data either we can use mean, median or mode to insert that
as header and as the last column label is blank in names.txt blank column to fill the missing data as we don’t have any
we have labelled the last 42nd column as label. On careful missing data in KDD data set we have not applied any of
observation on the dataset we could asses in Table I. to these technique. The next thing was to visualize the KDD
Table IX. That this data set has huge number of duplicate data label distribution after executing barplot code in R we
records which needs to be removed to train our Neural can see the distribution of data in Figure 4.
Network model else overfitting of data will occur. Duplicate
rows were removed by using unique function in R and also Filtering the data has many stages to find the zero
with excel file option. variance within the columns (features). When applied zero
variance filter, have identified column number (1 6 7 8 9 10
The next step is to identify the nominal columns in the 11 13 14 15 16 17 18 19 20 21 22 29 31 32 33 34 37) and
collected data. The column number 2,3,4,42 identified as remove those columns, this means we can remove these
nominal columns (2-protocol_type, 3-service, 4-flag, 42- columns from our dataset to reduce the dimensionality of
Label which is our target variable). To train any machine our dataset. After removing the above columns, we have
learning model we need all the data in numeric form so we only 19 columns which are (2 3 4 5 12 23 24 25 26 27 28 30
applied and converted nominal columns data to numeric data 35 36 38 39 40 41 42).
to change this raw data to machine understandable form.
For example, the near zero variance predictor is one that,
In KDD data set we have found 42 Features and total for 1000 samples, has two distinct values and 999 of them
494020 records and after removing duplicates we have are a single value [5].
145586 number of rows of data. In target variable label we
have 23 attack types identified and we have classified these FIGURE 4. KDDCUP’99 DATA SET ATTACKS DISTRIBUTION
23 attack types into 4 main categories Dos, R2L, U2R,
Normal.

FIGURE 3. KDDCUP’99 DATA SET PROCESS FLOW DIAGRAM

Generated using R Studio.

G. KDD Data Normalization.

In normalization all the records values are changed
between 0 and 1. Here normalization is done as different
features in the dataset do not have same range of values and
because of this gradient decent may take more time and
oscillate back forth which will take a long time before it can
find global or local minima and to overcome the NN model
learning problem it is important to normalize the data so that
we make sure the different features take same range of
values so that gradient decent can converge fast. The model
accuracy improves with the normalized data.

Once the data is cleansed by removing duplicate rows,

removing zero variance columns , checking data for blank
cells and NA columns, replacing nominal columns with the TABLE XI. RANDOM FOREST RESULTS
numeric data then normalizing the data we can load this data dos normal probe r2l u2r
to MATLAB for processing and then transform the data in
MATLAB and simultaneously load the target class data for Dos 99.99 0 0 0 0
processing. normal 0.03 99.79 0.04 0.1 0.05
probe 0.3 0.38 99.27 0.05 0
By normalizing the data, we can achieve high
r2l 0 1.39 0 98.61 0
performance with high dimensions when we have many
classification to be done in data set like in KDD where we u2r 0 0 0 25 75
have 23 classification it is important to keep features which
can build better model as low dimensionality suffers from Result: The efficiency of Random Forest is almost 99%.
high prediction rate where as high dimensions with
important feature will gives better prediction accuracy. J. Feature Selection, Data Cleaning and Dimensionality
selection.
In our experiment below we have classified data in to
three models. In the first mode I. we classify data in to 2 After applying various filters (duplicate records removal,
(normal / anomaly) and in model II 23 classification unique records identification, zero variance columns (eg:
grouped in to 5 main categories (Normal, U2R, R2L, Probe, column 20 and 21), cleaning of dataset, Feature selection
Dos) and in the model III classify data into 23 classifications with Random Forest, Principal Component Analysis, Weka
based on labels in column 42. tool attribute selection and R language library feature
selection with variance. By applying all the techniques, we
The test dataset Kdd.data Full dataset has 4898430 got features as listed in Table XII. Which we will use and
records with observations 42 features and the train data has test in our Model I, Model II and Model II of Neural
494020 records with 42 features. Network.

TABLE XII. FEATURE SELECTION AFTER DIMENSIONALITY REDUCTION

H. Navie Bayer.
Built Navie Bayes model after removing and applying Code Column Names/Numbers Dimentions
near zero variance filter in R using package library e1071. PC12 2,13,22,24,25,28,30,31,32,33,37,38 12
ALLD 2,3,5,6,23,24,39 7
On features (2 3 4 5 12 23 24 25 26 27 28 30 35 36 38 39 40 HDR1 3,5,6,39 4
41 42) total 19 features. HDR2 3,4,5,6,14,16,27,28,37,39 10
IGR 2,3,5,6,23,24,33,34,35,36 10
ME 2,3,4,5,6,12,14,16,21,23,24,27,28,29,30,31,32,33,34,35,36,37,39 23
The accuracy of Naïve Bayes output is 64% on the above PCA 4,21,23,24,27,31,32,36,37,40 10
selected feature set. WEKA 2,3,4,5,6,7,8,14,23,30,36 11
WEKA_RF
27.28.4.24.21.23.37.32.13.5.3.31.8.36,6 15
R_RF 24,25,4,38,12,6,5,14,31,1,34,33,30,37,21,10,32,35,36,23,28,22,29,2,41 25
I. Random Forest.
The above Table XIII display various features from low
Random forest algorithm performed on 25 features listed
dimensionality to high dimensionality extracted using pca,
here under.
weka, feature extraction algorithm etc. (HDR1, HDR2, IGR,
WEKA, WEKA_R) -> weka (PCA12, PCA, R_RF) -> R.
List of 25 features selected for Random Forest.
(ALLD, ME) -> manual selection by data analysis using R.
(srv_rerror_rate rerror_rate flag
dst_host_rerror_rate logged_in dst_bytes
Section III. Neural Network Topology for KDDCUP’99
src_bytes num_compromised Dataset
dst_host_srv_count duration
dst_host_same_src_port_rate dst_host_diff_srv_rate In building model’s, I to III MATLAB was used to
dst_host_count dst_host_srv_serror_rate count hot build Neural Network, Data transformations, Data
dst_host_same_srv_rate Normalization, plotting confusion matrix, ROC etc. WEKA
dst_host_srv_diff_host_rate dst_host_serror_rate tool was used for feature selection and for Naïve Bayes,
serror_rate srv_serror_rate diff_srv_rate srv_count random forest algorithm. R programming language was use
srv_diff_host_rate protocol_type result ). for principal component analysis, Data analytics,
visualization and to check which features are influencing the
5 main categories and variance filtrations etc MS Excel for
basic data manipulations.
TABLE X. RANDOM FOREST PREDICTION
A. Neural Network Model I. (Classification either attack or
Pred dos normal probe r2l u2r normal)
a) In model I we have segregated the dataset by taking
dos 352278 16 8 2 0 50% of smurf attack types and 50% of normal attack types
normal 23 87507 31 85 43 and clubbed together to classify and test. Here 641 records
of normal category and 641 records of smurf category were
probe 11 14 3657 2 0 taken and clubbed these records and created 1282 records of
r2l 0 13 0 923 0 sample dataset for testing. Model I dataset details are given
u2r 0 0 0 1 3 below.
TABLE XIII. MODEL I DATASET DETAILS FIGURE 6. MODEL I (ROC) RECEIVER OPERATION CHARACTERISTICS

Data / Records smurf normal

Total records 280790 97278
Unique records 641 87832
Duplicate records 20149 9446
Model I 641 641

b) After applying pre-processing techniques as

discussed in above section the data is automatically loaded
by single script into MATLAB which will load the dataset
and transform the data further data is divided into training,
validations and testing sets. This script has options to
modify or change network training functions or error
functions and will display the Receiver Operating
characteristics (ROC), Confusion Matrix, Error Histogram,
Training state (gradient and validation checks), Performance
(best validation performance after certain epochs). f) Model I. Performance.

c) Neural Net Topology Percentage Correct Classification: 100.000%

Here to train and build neural network for target Percentage Incorrect Classification: 0.000%
dataset and sample dataset was supplied to neural net with B. Neural Network Model II. (Classification of 23 attack
hidden layer size of 10 and Scaled conjugate training types grouped into main 4 categories)
function was used for training the NN and input output
pre/post processing functions removeconstantsrows and
mapminmax was used. For division of data dividerand a) In model II the KDDCUP’99 dataset 23 attack types
function was used to divide data randomly with divide mode were categorised based on the segregations of attack types in
to sample and train ration was set to 70% validation set was to 5 main groups as listed from Table V. to Table IX. The 5
set to 15 percent and test set was set to 15 %. For main categories are Normal, R2L, U2R, Probe, Dos. Normal
Performance metrics Crossentropy function was used and has only 1 sub type. R2L has 8 sub types. U2R has 4 sub
ploting functions plotperform, plottrainstate, ploterrhist, types. Probe have 4 sub types and Dos has 6 sub type attack
plotconfusion, plotroc were used to plot the results/output of types as listed in Table V to Table IX.
Neural Network.
b) The pre-processed data set after applying data
d) After running the model I we have achieved 100% cleaning methods as described in Section II. We have got
detection rate by classifying normal and attack as shown in list of 10 Feature selection options which is coded some are
Figure 5. Confusion matrix the two output classes correctly with high dimensionality reduction with less features and
classified with two target classes. few are features extracted after applying principal
component analysis, variance selection, weka attribute
e) In the above Figure 5 the target class and the output selection tool. we will use this pre-processed data which has
classes were correctly classified with 100% true positives. In been normalised and converted from nominal to numeric to
Figure 6 the ROC plot we can observe the class 1 and class train and test on Neural Network and we will test all the
2 plotted on Y axis from 0 to 1 vertically straight without features listed in Table XII.
any divergence or false positives.
c) With MATLAB code data is automatically loaded in
FIGURE 5. CONFUSION MATRIX OF MODEL I to MATLAB transformed and processed both sample and
target data to train neural networks with different setting for
neural network.

d) Neural Network Topology

In this Model II the output and the target is divided
in to 5 classifications and the dataset was supplied to neural
network with options a) 1 hidden layer with 10 neurons and
b) later modified with 1 hidden layer with 30 neurons and c)
2 hidden layers 1st layer with 30 neurons and 2nd layer with 5
neurons. Option b) and c) gave optimal results. The training
functions used was scaled conjugate gradient
backpropagation function (SCG) this function updates
weights and bias values according to the scaled conjugate
gradient method. It was used for training the network and
cross entropy function was used as performance function
data was divided in to train 70% validation 15% testing
15%. All the parameters can be modified to change
functions or setting of the neural network.
Bayesian regularization backpropagation was also e) In the above Figure 6. Scaled conjugate gradient
tested as training function. It updates the weights and bias backpropagation function was used to train the neural
values according to Levengerg-Marquardt optimization. It network achieved optimal performance in less time with
minimizes a combination of squared errors and weights, and labelled datasets (supervised) and in Figure 7. And Figure 8.
then determines the correct combination so as to produce a Bayesian regularization backpropagation function was used
network that generalizes well. The process is called to train the neural network without the labelled dataset
Bayesian regularization [7]. This function run on Pentium i7 which took long time after but performed better with out
with 2 core 4 logical processors with 16 GB ram. labelled dataset all the categories Normal, R2L, Probe, Dos,
achieved optimal performance 97% to 99% but U2R which
Trainbr function takes longer but it is better for suffers with less dataset (rows) was misclassified and
challenging problems and the results were optimal. achieved 40.8 % even though the neural network was
supplied with unlabelled dataset it performed better in terms
FIGURE 6. CONFUSION MATRIX OF MODEL II SCALED CONJUGATE of classification.
GRADIENT BACKPROPAGATION (SCG)
FIGURE 8. ROC PLOT OF MODEL II BAYESIAN REGULARIZATION
BACKPROPAGATION (BRB)

MODEL II CONFUSION MATRIX 100% SCG

TABLE XIV. MODEL II DATASET DETAILS

Attack Type No of Records Percent f) In Figure 6. We can absorb that all the 23 attack
types which were categorised in to 5 main categories
normal 87832 60.330%
classified achieved the accuracy of 100% .
u2r 52 0.036%
r2l 999 0.686% g) Model II Performance
Probe 2131 1.464% Percentage Correct Classification: 100.000%
dos 54572 37.484% Percentage Incorrect Classification: 0.000%

FIGURE 7. CONFUSION MATRIX OF MODEL II BAYESIAN C. Neural Network Model III. (Classification of 23 attack
REGULARIZATION BACKPROPAGATION (BRB)
types)

a) In Model III here tested with two data sets. Original

and synthetic(duplicate) dataset.

1.Model III experiment 1. Original dataset without

duplicate records. Which suffers class imbalance problem.
We will resolve this problems of class imbalance in dataset
and apply one of the solutions for solving this problem in
Model III experiment 2.

2.Model III experiment 2. Original dataset with

equal no of instances in all 23 categories each category with
100 records. The no of instances in all 23 categories ranged
from 2 instances(records) to 6000 plus instances. Few of the
categories where having only 2 records. created duplicate
records similar to synthetic data sets of original and
replicated to 100 records. All 23 categories have 100 records
each. Here we have applied one of the solutions for class FIGURE 11. ROC MODEL III EXPERIMENT 2
imbalance problem to test our model performance.

b) Neural Network Topology

In this model 2 hidden layers the first layer with 30
neurons the second layer with 23 neurons the training
function used is scaled conjugate gradient backpropagation
function with cross entropy as performance function and
plotted confusion matrix, ROC, error histogram and
validation plots.

FIGURE 9. CONFUSION MATRIX MODEL III EXPERIMENT 1

Section IV. Observations and Statistical Analysis of

KDDCUP’99 Dataset
In Figure 9. As there few of the classes fall under
minority group and have only 2 rows to 10 rows per
class only those categories were misclassified and As KDDCUP’99 dataset is skewed, there is data
model could not achieve the optimal performance. distribution problem in KDDCUP’99 dataset
U2R+R2L+Probe are minority group contains significantly
FIGURE 10. CONFUSION MATRIX MODEL III EXPERIMENT 2 fewer number of records than the DOS+Normal which is
majority group because of these different distribution of data
KDDCUP’99 dataset suffer class imbalance problem.
The author [6]. Justin M. Johnson * and Taghi M.
Khoshgoftaar have done survey on deep learning with class
imbalance [6].
(Justin M. Johnson * and Taghi M. Khoshgoftaar )[6].
(2019) on Machine learning techniques for class imbalanced
data: Addressing class imbalance with traditional machine
learning techniques has been studied extensively over the last
two decades [6]. The bias towards the majority class can be
alleviated by altering the training data to decrease imbalance,
or by modifying the model’s underlying learning or decision
process to increase sensitivity towards the minority group
[6]. As such, methods for handling class imbalance are
grouped into data-level techniques, algorithm-level methods,
and hybrid approaches [6].
Principal component analysis in Figure 12 red lines
represent columns and the black clusters represent the rows.
The red arrows which are pointing towards the same
direction and which are close to each other have more
correlation among them the plotted graph gives us the clear
In Figure 10. Of Model III experiment 2 all the 23 visualization of PC1 and PC2.
classes were correctly classified and the model
performed well with 99.43% accuracy. FIGURE 12. PRINCIPAL COMPONENT ANALYSIS

c) Model III Performance.

Percentage Correct Classification : 99.434783%

Percentage Incorrect Classification: 0.565217%
Section V. Conclusion and future work. [5] Zero Variance Filter URL [online]
https://www.rdocumentation.org/packages/caret/versions/6.0-
In this paper three models were built with different 82/topics/nearZeroVar .
parameter on Neural Network and their performance matrix [6] Survey on deep learning with class imbalance Justin M. Johnson *
were captured. Experiments based on kddcup99 dataset are and Taghi M. Khoshgoftaar Johnson and Khoshgoftaar J Big Data
encouraging results. (2019) 6:27 https://doi.org/10.1186/s40537-019-0192-5 .
[7] MATLAB Software Documentation. [online ].
All the three models were tested with different training [8] R Language Software Documentation [online].
functions trainscg(Scaled conjugate gradient [9] Weka Software tool Documentation [online].
backpropagation.), trainbr(Bayesian regularization [10] Accuracy Paradox [online].
backpropagation), trainlm(Levenberg-Marquardt https://en.wikipedia.org/wiki/Accuracy_paradox
backpropagation), newbr(radial biases function ) and
different epocs settings, the design of software is such that Appendix I.
all the different parameters can be changed like number of
The Training dataset (kddcup_10%_labeled) has 23
layers, number of neurons, error functions, etc. Model I
attack types and the Test dataset (corrected) 38 attack types
gives 100% accuracy as datasets are balanced it was a binary
with 21 attack types duplicated in both the data set and
classification, Model II in first experiment Figure 7 and 8 overall 40 unique attack types combined in training and
which was run as unsupervised learning using Bayesian testing dataset which accounts for 17 new attack types in
regularization backpropagation which showed optimal training dataset and 2 attack types spy and warezclient which
results and in experiment II Scaled conjugate gradient are there only in testing dataset. Finally 21 duplicate attack
backpropagation which gave 100% accuracy and model III types in both test and train dataset 2 unique in training and 17
in experiment 1 original dataset without duplicate records unique in testing dataset which comes down to 21+17+2 = 40
was tested due to class imbalance problem minority class attack types.
suffered misclassification and majority class was classified
99% accuracy. In Experiment 2 synthetic data was added
TABLE XV. KDD TRAIN AND TEST DATASET ATTACK TYPES
manually and tested the results by balancing the datasets got
optimal results with 100% accuracy in all classes. S.No Train Test_Dataset Unique
Records
Dataset Train+Test
There are lot of scope for further evaluation of
dataset with the aid of Hadoop, Scala for better performance 1 normal normal. normal
while dealing with huge datasets and apply deep learning
2 back snmpgetattack. back
methods on KDDCUP’99 dataset and apply class imbalance
techniques on dataset to improve classification accuracy of 3 Buffer named. Buffer
minority classes. overflow overflow
4 ftp_write xlock. ftp_write
Another dataset is also available NSL-KDD dataset as
stated on its website its main purpose is to have a stable equal 5 guess_passwd smurf. guess_passwd
distribution of dataset for the research. 6 imap ipsweep. imap
This work can be further extended as KDD dataset suffer 7 ipsweep multihop. ipsweep
class imbalance problem so synthetic data set can be inserted.
8 land xsnoop. land
Based on the above observations we propose Features
Selection for KDD dataset in Table XII. High Dimension 9 loadmodule sendmail. loadmodule
Reduction features reduces high accuracy of attack detection 10 multihop Guess passwd. multihop
where as optimal features detect all the attack types.
11 neptune saint. neptune
KDD data set also suffers from giving high Accuracy
results 99% accuracy and it’s a major problem with this 12 nmap buffer_overflow. nmap
dataset as model can predict value of majority class for all 13 perl portsweep. perl
predictions and achieve a high classification accuracy with a
large class imbalance dataset which is known as Accuracy 14 phf pod. phf
Paradox [10]. In this data set Majority class is giving us high 15 pod apache2. pod
accuracy rate in such cases precision and recall can help but
precision can also be biased by very unbalanced classes. 16 portsweep phf. portsweep
17 rootkit udpstorm. rootkit
REFERENCES 18 satan warezmaster. satan
[1] The KDD-CUP 1999 dataset published on this website [online]
Available: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html 19 smurf perl. smurf
[2] Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA 20 spy satan. spy
Workbench. Online Appendix for "Data Mining: Practical Machine
Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 21 teardrop xterm. teardrop
2016..
[3] MATLAB and Statistics Toolbox Release 2014a The MathWorks, 22 warezclient mscan. warezclient
Inc., Natick, Massachusetts, United States. 23 warezmaster processtable. warezmaster
[4] R Core Team (2013). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria. 24 ps. snmpgetattack
URL http://www.R-project.org .
25 nmap. named
26 rootkit. xlock
27 neptune. xsnoop
28 loadmodule. sendmail
29 imap. saint
30 back. apache2
31 httptunnel. udpstorm
32 worm. xterm
33 mailbomb. mscan
34 ftp_write. processtable
35 teardrop. ps
36 land. httptunnel Appendix III. Evaluators and algorithums performed on
WEKA
37 sqlattack. worm
1.kMeans
38 snmpguess. mailbomb
Scheme: weka.clusterers.SimpleKMeans -init 0 -max-
39 sqlattack candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1
40 snmpguess -1.25 -t2 -1.0 -N 2 -A "weka.core.EuclideanDistance -R first-
last" -I 500 -num-slots 1 -S 10
Relation: X_Sample
Appendix II. Instances: 1282
Attributes: 42
Data Analytics on KDD dataset using R Time taken to build model (full training data) : 0.08 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 641 ( 50%)
1 641 ( 50%)

2.attributeSelection.CorrelationAttributeEval
Evaluator: weka.attributeSelection.CorrelationAttributeEval
Search: weka.attributeSelection.Ranker -T -
1.7976931348623157E308 -N -1
Relation: X_Sample
Instances: 1282
Attributes: 42
Exploratory Analysis
Search Method:
Observation 1 : in the above image we can see that Attribute ranking.
dst_host_same_src_port_rate got slight effect on the Attribute Evaluator (supervised, Class (numeric): 42
intrusion type and dst_host_same_src_port_rate value more label):
than equal to 1 the attack type can be probe and r2l. Correlation Ranking Filter
Ranked attributes:
0.991 2 protocol_type
Observation 2 : In this observation we can see that flag is 0.9876 12 logged_in
a strong predictor. for flag= REG and S0 it is dos 0.885 3 service
0.576 37 dst_host_srv_diff_host_rate
0.3157 31 srv_diff_host_rate
0.3152 6 dst_bytes
0.2767 34 dst_host_same_srv_rate
0.1951 39 dst_host_srv_serror_rate
0.1789 33 dst_host_srv_count
0.0477 30 diff_srv_rate
0.0425 26 srv_serror_rate
0.0397 25 serror_rate
0.0395 10 hot
0.0394 1 duration
3.Evaluator:
weka.attributeSelection.ClassifierAttributeEval -execution-
slots 1 -B weka.classifiers.rules.ZeroR -F 5 -T 0.01 -R 1 -E
Observation 3 : for the duration more than 30000 we can DEFAULT --Search: weka.attributeSelection.Ranker -T -
say that it's probe so duration is a strong predictor. 1.7976931348623157E308 -N -1
Relation: X_Sample Instances: 1282 Attributes: 42

Research Paper - Lakhan
No ratings yet
Research Paper - Lakhan
18 pages
Intrusion Detection On Self Organizing Network Using Pca and Random Forest
No ratings yet
Intrusion Detection On Self Organizing Network Using Pca and Random Forest
19 pages
A Few-Shot Deep Learning Approach For Improved
No ratings yet
A Few-Shot Deep Learning Approach For Improved
7 pages
Parallelization of A Neural Network Algorithm For Use in Handwriting Recognition
No ratings yet
Parallelization of A Neural Network Algorithm For Use in Handwriting Recognition
6 pages
Evaluation of Different Data Mining Algorithms
No ratings yet
Evaluation of Different Data Mining Algorithms
19 pages
Peerj Preprints 1954
No ratings yet
Peerj Preprints 1954
22 pages
PDF 1
No ratings yet
PDF 1
24 pages
Feature Selection TJ-SZ-ISIE 2017-Camera Ready1
No ratings yet
Feature Selection TJ-SZ-ISIE 2017-Camera Ready1
7 pages
CNN-Based Network Intrusion Detection Against Denial-of-Service Attacks
100% (1)
CNN-Based Network Intrusion Detection Against Denial-of-Service Attacks
21 pages
Best Journal
No ratings yet
Best Journal
6 pages
Sat - 35.Pdf - Detection of Attacks (DoS Probe) Using Genetic Algorithm
No ratings yet
Sat - 35.Pdf - Detection of Attacks (DoS Probe) Using Genetic Algorithm
11 pages
ContentServer
No ratings yet
ContentServer
33 pages
INDEX1
No ratings yet
INDEX1
15 pages
Nour Mustafa
No ratings yet
Nour Mustafa
7 pages
IoT (15CS81) Module 4 Machine Learning
No ratings yet
IoT (15CS81) Module 4 Machine Learning
66 pages
Machine Learning and Deep Learning 2nd Review1
No ratings yet
Machine Learning and Deep Learning 2nd Review1
8 pages
An Ensemble Approach For Feature Selection of Cyber Attack Dataset
No ratings yet
An Ensemble Approach For Feature Selection of Cyber Attack Dataset
7 pages
A Study On NSL-KDD Dataset
100% (1)
A Study On NSL-KDD Dataset
7 pages
NIST Cybersecurity Framework (CSF) For Information Systems Security: NIST Cybersecurity Framework (CSF), #1
From Everand
NIST Cybersecurity Framework (CSF) For Information Systems Security: NIST Cybersecurity Framework (CSF), #1
Bruce Brown, CISSP
No ratings yet
DattaDeshmukhecs 2014 6892542
No ratings yet
DattaDeshmukhecs 2014 6892542
7 pages
Emotion Recognition From Facial Expression of Autism Spectrum Disordered Children Using Image Processing and Machine Learning Algorithms
No ratings yet
Emotion Recognition From Facial Expression of Autism Spectrum Disordered Children Using Image Processing and Machine Learning Algorithms
47 pages
Intelligent Methods For Intrusion Detection in Local Area Networks
No ratings yet
Intelligent Methods For Intrusion Detection in Local Area Networks
12 pages
Intrusion Detection Model Using Machine Learning Algorithms On NSL-KDD Dataset
No ratings yet
Intrusion Detection Model Using Machine Learning Algorithms On NSL-KDD Dataset
14 pages
Analyzing Denial-Of-Service Attacks in KDD CUP 99 Data Set For Intrusion Detection System
No ratings yet
Analyzing Denial-Of-Service Attacks in KDD CUP 99 Data Set For Intrusion Detection System
4 pages
KDD-CUP-99 Task Description
No ratings yet
KDD-CUP-99 Task Description
3 pages
An Intrusion Detection Model Based On A Convolutio
No ratings yet
An Intrusion Detection Model Based On A Convolutio
8 pages
UNSW-NB15: A Comprehensive Data Set For Network Intrusion Detection Systems (UNSW-NB15 Network Data Set)
No ratings yet
UNSW-NB15: A Comprehensive Data Set For Network Intrusion Detection Systems (UNSW-NB15 Network Data Set)
7 pages
Feature Selection For Intrusion Detection Using Neural Networks and Support Vector Machines
No ratings yet
Feature Selection For Intrusion Detection Using Neural Networks and Support Vector Machines
17 pages
Rpaper
No ratings yet
Rpaper
7 pages
1 - A Survey of Intrusion Detection Models Based On NSL-KDD Data Set (IEEE)
No ratings yet
1 - A Survey of Intrusion Detection Models Based On NSL-KDD Data Set (IEEE)
6 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
73 pages
Intrusion Detection On Self Organizing Network Using Pca and Random Forest
No ratings yet
Intrusion Detection On Self Organizing Network Using Pca and Random Forest
16 pages
K Nearest Neighbor Based Model For Intrusion Detection System
No ratings yet
K Nearest Neighbor Based Model For Intrusion Detection System
5 pages
Business Report: by Sreenath Radhakrishnan
No ratings yet
Business Report: by Sreenath Radhakrishnan
26 pages
Intrusion Detection in Software Defined Network Using Machine Learning
No ratings yet
Intrusion Detection in Software Defined Network Using Machine Learning
11 pages
A Comparative Analysis of Machine Learni
No ratings yet
A Comparative Analysis of Machine Learni
9 pages
A Detailed Analysis of The KDD CUP 99
No ratings yet
A Detailed Analysis of The KDD CUP 99
25 pages
Karsl: Arabic Sign Language Database: Ala Addin I. Sidig, Hamzah Luqman, Sabri Mahmoud, and Mohamed Mohandes
No ratings yet
Karsl: Arabic Sign Language Database: Ala Addin I. Sidig, Hamzah Luqman, Sabri Mahmoud, and Mohamed Mohandes
19 pages
An Ensemble Approach For Intrusion Detection System Using Machine Learning Algorithms
No ratings yet
An Ensemble Approach For Intrusion Detection System Using Machine Learning Algorithms
4 pages
Time Series and Spectral Analysis Part IV. ARIMA Forecasting
No ratings yet
Time Series and Spectral Analysis Part IV. ARIMA Forecasting
20 pages
Article 13
No ratings yet
Article 13
6 pages
Study On Decision Tree and KNN Algorithm For Intrusion Detection System IJERTV9IS050303
No ratings yet
Study On Decision Tree and KNN Algorithm For Intrusion Detection System IJERTV9IS050303
6 pages
Intrusion Detection - DM
No ratings yet
Intrusion Detection - DM
34 pages
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
No ratings yet
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
50 pages
A Feed-Forward and Pattern Recognition ANN Model For Network Intrusion Detection
No ratings yet
A Feed-Forward and Pattern Recognition ANN Model For Network Intrusion Detection
7 pages
Predicting Article Retweets and Likes Based On The Title Using Machine Learning
No ratings yet
Predicting Article Retweets and Likes Based On The Title Using Machine Learning
10 pages
Identifying Important Characteristics in The KDD99 Intrusion Detection Dataset by Feature Selection Using A Hybrid Approach
No ratings yet
Identifying Important Characteristics in The KDD99 Intrusion Detection Dataset by Feature Selection Using A Hybrid Approach
7 pages
Detection of Abnormalities in Real-Time Computer Network Traffic Empowered by Machine Learning
No ratings yet
Detection of Abnormalities in Real-Time Computer Network Traffic Empowered by Machine Learning
8 pages
A Study On NSL-KDD Dataset PDF
No ratings yet
A Study On NSL-KDD Dataset PDF
7 pages
A Detailed Analysis On NSL-KDD Dataset
No ratings yet
A Detailed Analysis On NSL-KDD Dataset
6 pages
NHL Players Salary Project Documentation
No ratings yet
NHL Players Salary Project Documentation
29 pages
Ieee - Intrusion Detection System Using Neural
No ratings yet
Ieee - Intrusion Detection System Using Neural
8 pages
DNP3 Protocol Engineering: Definitive Reference for Developers and Engineers
From Everand
DNP3 Protocol Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Network Coding and Signcryption for Cloud Data Integrity
From Everand
Network Coding and Signcryption for Cloud Data Integrity
Noah Joan
No ratings yet
AI Based Exercise Prescription System
No ratings yet
AI Based Exercise Prescription System
11 pages
Detect Depression From Communication How Computer Vision Signal Processing and Sentiment Analysis Join Forces
No ratings yet
Detect Depression From Communication How Computer Vision Signal Processing and Sentiment Analysis Join Forces
14 pages
Intrusion Detection Using An Improved Competitive Learning Lamstar Neural Network
No ratings yet
Intrusion Detection Using An Improved Competitive Learning Lamstar Neural Network
9 pages
Data Mining: Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Data Mining: Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
15 pages
Intrusion Detection Using Neural Networks and Support Vector Machines
No ratings yet
Intrusion Detection Using Neural Networks and Support Vector Machines
6 pages
A System Based On Naive Bayesian For Denial-Of-Service Attack Detection
No ratings yet
A System Based On Naive Bayesian For Denial-Of-Service Attack Detection
4 pages
Network Intrusion Detection Based On Fuzzy Logic
No ratings yet
Network Intrusion Detection Based On Fuzzy Logic
7 pages
Detecting Port Scan Attempts With Comparative Analysis of Deep Learning and Support Vector Machine Algorithms
No ratings yet
Detecting Port Scan Attempts With Comparative Analysis of Deep Learning and Support Vector Machine Algorithms
4 pages
Wavelet Convnets For Texture Classification
No ratings yet
Wavelet Convnets For Texture Classification
9 pages
A Deeper Dive Into The NS1
No ratings yet
A Deeper Dive Into The NS1
5 pages
Measuring Mitigating Unintended Bias Paper
No ratings yet
Measuring Mitigating Unintended Bias Paper
7 pages
Selecting Features For Intrusion Detection: A Feature Relevance Analysis On KDD 99 Intrusion Detection Datasets
No ratings yet
Selecting Features For Intrusion Detection: A Feature Relevance Analysis On KDD 99 Intrusion Detection Datasets
6 pages
5 Pre Diagnosis of Hypertension Using Artificial Neural Network PDF
No ratings yet
5 Pre Diagnosis of Hypertension Using Artificial Neural Network PDF
7 pages
Denial-of-Service, Probing & Remote To User (R2L) Attack Detection Using Genetic Algorithm
No ratings yet
Denial-of-Service, Probing & Remote To User (R2L) Attack Detection Using Genetic Algorithm
6 pages
How To Avoid Overfitting Using Robustness Tests: Whitepaper by
0% (1)
How To Avoid Overfitting Using Robustness Tests: Whitepaper by
14 pages
Intrusion Detection System Using Data Mining
No ratings yet
Intrusion Detection System Using Data Mining
4 pages
My Datasetttt
No ratings yet
My Datasetttt
4 pages
Sentiment Analysis of Twitter
No ratings yet
Sentiment Analysis of Twitter
26 pages
IS421 Tutorial 4-Solution
No ratings yet
IS421 Tutorial 4-Solution
4 pages
Attribute Normalization in Network Intrusion Detection: Wei Wang Svein J. Knapskog Sylvain Gombault
No ratings yet
Attribute Normalization in Network Intrusion Detection: Wei Wang Svein J. Knapskog Sylvain Gombault
7 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
47 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Quiz Feedback1 - Coursera
100% (1)
Quiz Feedback1 - Coursera
7 pages
Systematic Approach To Intrusion Evaluation Using The Rough Set Based Classification
No ratings yet
Systematic Approach To Intrusion Evaluation Using The Rough Set Based Classification
6 pages
Packet and Flow Based Network Intrusion Dataset
No ratings yet
Packet and Flow Based Network Intrusion Dataset
12 pages
Designing of Intrusion Detection System Based On Image Block Matching
No ratings yet
Designing of Intrusion Detection System Based On Image Block Matching
3 pages
Lecture 13 Image Segmentation Using Convolutional Neural Network
No ratings yet
Lecture 13 Image Segmentation Using Convolutional Neural Network
9 pages
Artificial Intelligence Chapter 18 (Updated)
No ratings yet
Artificial Intelligence Chapter 18 (Updated)
19 pages
Network Intrusion Detection Using Association Rules: Flora S. Tsai
No ratings yet
Network Intrusion Detection Using Association Rules: Flora S. Tsai
3 pages
Exercise 04 Linear Regression PDF
No ratings yet
Exercise 04 Linear Regression PDF
17 pages
Feature Selection For Effective Anomaly-Based Intrusion Detection
No ratings yet
Feature Selection For Effective Anomaly-Based Intrusion Detection
5 pages
Evolution of Machine Learning Algorithm
No ratings yet
Evolution of Machine Learning Algorithm
21 pages
Neural and Fuzzy Systems
No ratings yet
Neural and Fuzzy Systems
27 pages
MNIST Based Handwritten Digits Recognition
No ratings yet
MNIST Based Handwritten Digits Recognition
5 pages
Mini Project On: Heart Disease Analysis and Prediction
No ratings yet
Mini Project On: Heart Disease Analysis and Prediction
26 pages
Kaspersky Lab Whitepaper Machine Learning
No ratings yet
Kaspersky Lab Whitepaper Machine Learning
17 pages

Intrusion Detection Using Artificial Intelligence On KDD Data Set

Uploaded by

Intrusion Detection Using Artificial Intelligence On KDD Data Set

Uploaded by

Intrusion Detection Using Artificial Intelligence on

KDD Data Set

I. DESCRIPTION OF KDDCUP’99 DATA SET

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

S.NO ATTACK NO OF ROWS AFTER REMOVING

1 NORMAL 87832 The KDDCUP’99 dataset is classified in to 23 attack

FIGURE 3. KDDCUP’99 DATA SET PROCESS FLOW DIAGRAM

Generated using R Studio.

G. KDD Data Normalization.

Once the data is cleansed by removing duplicate rows,

TABLE XII. FEATURE SELECTION AFTER DIMENSIONALITY REDUCTION

Data / Records smurf normal

b) After applying pre-processing techniques as

c) Neural Net Topology Percentage Correct Classification: 100.000%

d) Neural Network Topology

MODEL II CONFUSION MATRIX 100% SCG

TABLE XIV. MODEL II DATASET DETAILS

a) In Model III here tested with two data sets. Original

1.Model III experiment 1. Original dataset without

2.Model III experiment 2. Original dataset with

b) Neural Network Topology

FIGURE 9. CONFUSION MATRIX MODEL III EXPERIMENT 1

Section IV. Observations and Statistical Analysis of

c) Model III Performance.

Percentage Correct Classification : 99.434783%

You might also like