Intrusion Detection Using Artificial Intelligence On KDD Data Set
Intrusion Detection Using Artificial Intelligence On KDD Data Set
Raghvendra Giri
Computer Engineering and Media
De Montfort University
Leicester, United Kingdom
[email protected]
Abstract—Developing an efficient Intrusion Detection Program which lasted for nine weeks and TCP dump raw
System (IDS) had attracted many researchers in data was collected by simulating US Air Force LAN as if it
detecting the novel network attacks, as the data is was US Air Force environment.
exponentially increasing and with the new stringent In Section I. will describe the KDDCUP’99 data set
global policies to protect and sharing data in place it is Description, Structure, Attack Categories, Issues/problems of
vital to protect organizations, small scale business like, KDDCUP’99 dataset. Section II describes Feature selection
hospitals, power grids, healthcare grids. Here we are techniques, PCA and perform Naïve Bayes, Random Forest,
using KDDCUP’99 data set for evaluation of intrusion and discuss the Trade-off of reducing the dimensionality to
detection and also we will look the problems of improve performance versus with high dimensions to
KDDCUP’99 dataset and propose development of novel improve high prediction accuracy, correlation with high
efficient IDS with the aid of Artificial Intelligence. detection rate vs normalization of data for better
performance. Section III describes Neural Network
Keywords—Neural Network, KDDCUP’99 data set, Topology, Model I (normal/abnormal), Model II
Intrusion Detection (IDS), Artificial Intelligence. (classification based on 5 main categories), Model III
(classification of 23 categories) and various experiments with
Software Tools Used – 1) MATLAB Programming code different dimension selection and different NN training
for normalization, processing / loading / transformation of functions, error functions used. Section IV describes the
data and Designing Neural Network Topology. 2) R Observations and results of KDDCUP’99 dataset with
Programming Language coding for statistical analysis, statistical analysis and Neural Network. Section V describes
data trends observations, Principal Component Analysis, the Conclusions, propose KDDCUP’99 Features and future
Random Forest, Data Analytics and Visualization. 3) work.
WEKA – to classify/select attribute importance based on
Section I. KDDCUP’99 Dataset
average impurity decrease, Random Forest, PCA.
Serial CATEGORIES
after Removing
Duplicate rows
No of Records
Number Abbreviation Category Type Example
No of Total
Duplicates
1 DOS Denial-of-Service Syn flood
Category
Records
Remote to Local
Guessing
label
2 R2L (unauthorized access from
S.No
Password
a remote machine)
User to Root (unauthorized
Buffer
3 U2R access to local 1 back. 2203 968 1235
superuser(root) privileges)
overflow dos
Probing 2 Buffer u2r 30 30 0
(Scanning/determining Port overflow
4 Probing
weakness or venerability of scanning
remote machine) 3 ftp_write. 8 8 0
r2l
a. KDDCUP’99 Data Set [1].
4 Guess r2l 53 53 0
FIGURE 1. CATEGORY DISTRIBUTION OF KDDCUP’99 DATA SET passwd.
5 imap. r2l 12 12 0
6 ipsweep. probe 1247 651 596
7 land. dos 21 19 2
8 loadmodule. u2r 9 9 0
9 multihop. r2l 7 7 0
FIGURE 2. CATEGORY DISTRIBUTION OF KDDCUP’99 DATA SET 10 neptune. 107201 51820 55381
dos
11 nmap. probe 231 158 73
DISTRIBUTION OF KDD DATASET
dos, 54572,
38.48% 12 normal. u2r 97277 87832 9445
13 perl. r2l 3 3 0
probe, 2131, 14 phf. dos 4 4 0
1.46%
15 pod. probe 264 206 58
normal, 87832,
16 portsweep. u2r 1040 416 624
r2l, 999, 0.69% u2r, 52, 0.04%
60.33%
17 rootkit. probe 10 10 0
18 satan. dos 1589 906 683
19 smurf. r2l 280790 641 280149
20 spy. dos 2 2 0
TABLE II. DUPLICATE RECORDS IN KDDCUP’99 DATA SET
21 teardrop. r2l 979 918 61
Number of Records Count
Total no of Original Rows 494020 22 warezclient. r2l 1020 893 127
Total no of Rows After 23 warezmaster. 20 20 0
Removing Duplicates 145586
normal
No of Duplicate Rows
Removed 348434
Total 494020 145586 348434
TABLE III. CATEGORY WISE DATA IN KDDCUP’99 DATA SET
Attack Type No of Records Percent f) In Figure 6. We can absorb that all the 23 attack
types which were categorised in to 5 main categories
normal 87832 60.330%
classified achieved the accuracy of 100% .
u2r 52 0.036%
r2l 999 0.686% g) Model II Performance
Probe 2131 1.464% Percentage Correct Classification: 100.000%
dos 54572 37.484% Percentage Incorrect Classification: 0.000%
FIGURE 7. CONFUSION MATRIX OF MODEL II BAYESIAN C. Neural Network Model III. (Classification of 23 attack
REGULARIZATION BACKPROPAGATION (BRB)
types)
2.attributeSelection.CorrelationAttributeEval
Evaluator: weka.attributeSelection.CorrelationAttributeEval
Search: weka.attributeSelection.Ranker -T -
1.7976931348623157E308 -N -1
Relation: X_Sample
Instances: 1282
Attributes: 42
Exploratory Analysis
Search Method:
Observation 1 : in the above image we can see that Attribute ranking.
dst_host_same_src_port_rate got slight effect on the Attribute Evaluator (supervised, Class (numeric): 42
intrusion type and dst_host_same_src_port_rate value more label):
than equal to 1 the attack type can be probe and r2l. Correlation Ranking Filter
Ranked attributes:
0.991 2 protocol_type
Observation 2 : In this observation we can see that flag is 0.9876 12 logged_in
a strong predictor. for flag= REG and S0 it is dos 0.885 3 service
0.576 37 dst_host_srv_diff_host_rate
0.3157 31 srv_diff_host_rate
0.3152 6 dst_bytes
0.2767 34 dst_host_same_srv_rate
0.1951 39 dst_host_srv_serror_rate
0.1789 33 dst_host_srv_count
0.0477 30 diff_srv_rate
0.0425 26 srv_serror_rate
0.0397 25 serror_rate
0.0395 10 hot
0.0394 1 duration
3.Evaluator:
weka.attributeSelection.ClassifierAttributeEval -execution-
slots 1 -B weka.classifiers.rules.ZeroR -F 5 -T 0.01 -R 1 -E
Observation 3 : for the duration more than 30000 we can DEFAULT --Search: weka.attributeSelection.Ranker -T -
say that it's probe so duration is a strong predictor. 1.7976931348623157E308 -N -1
Relation: X_Sample Instances: 1282 Attributes: 42