Intelligent Systems for Cybersecurity
MACHINE LEARNING CYBERSECURITY
PHISHING DETECTION
LAB 2: WRITING A CLASSIFIER FOR PHISHING DATASET
Lab Description: This lab is to write the python script as well as use
WEKA to implement a binary classifier to estimate whether a website is a
phishing website. The dataset contains 102816 web hits and 30 features were
recorded for each of the hit. Also, a class value has been given for each of the
record.
Example of phishing dataset:
Features Description:
Intelligent Systems for Cybersecurity
Pr. Meryeme Ayache Page | 2
Intelligent Systems for Cybersecurity
You are required to implement it in three ways:
Using the machine learning software WEKA.
Writing a python script with the use of the package sklearn
Writing a python script with the use of the package tensorflow and
deep learning techniques.
Lab Environment: The student should have access to no matter a
machine with Linux system or Windows system, but the environment for
python is required as well as some packages such as numpy, tensorflow
and sklearn.
Lab Files that are Needed: For this lab you will need one file
(phishing_1.csv) the last column is the class value, others are the features.
LAB EXERCISE 1
Import data into WEKA (explorer), the files of type should be specified (csv).
Choose a proper classifier, such as RandomForest
Pr. Meryeme Ayache Page | 3
Intelligent Systems for Cybersecurity
Specify the test option and the column of class
LAB EXERCISE 2
In this exercise, you need to implement several classifiers with the use
of sklearn.
Import sklearn code and required libraries
Read the features and class values from malware dataset with proper
method
Phishing_1.csv is the name of the file.
delimiter indicates the character to split the data in a row.
usecols indicates which columns will be read. For features, the
columns from 1 to 30 will be read. For class values, the first columns
of the rows will be read.
dtype indicates the type of data to read
Pr. Meryeme Ayache Page | 4
Intelligent Systems for Cybersecurity
Since the first line of the file is names for each column,
we set skip_header to 1 to avoid read the first row.
Split the dataset. When you finish the preprocess step, you can write
the python script with the use of sklearn package to build your
architecture of classifier.
random_state is the seed used by the random number generator
This is for the decision tree:
Please print the statistics metrics such as accuracy, recall, precision and
f1 score.
Implement the classifiers based on Logistic Regression, Decision Tree,
Naïve Bayes and Random Forest
LAB EXERCISE 3
Use the same data you use in the exercise 1 and 2.
In this exercise, you will implement an artificial neural network classifier
based on Tensorflow
Pr. Meryeme Ayache Page | 5
Intelligent Systems for Cybersecurity
Import the required libraries
Repeat the same steps to preprocess the data as Exercise 2. Read the
data, standard scale the feature and encode the labels.
Define the learning rate and number of epochs for artificial neural
network
An extra step in preprocess is to perform the one-hot encoding for the
labels.
Split the dataset after preprocessing and define the parameters to store
the shape of placeholder.
Define the function to draw the plot of performance
Pr. Meryeme Ayache Page | 6
Intelligent Systems for Cybersecurity
Define your own architecture of neural network
Please print the statistics metrics such as accuracy, recall, precision and
f1 score.
Pr. Meryeme Ayache Page | 7
Intelligent Systems for Cybersecurity
Initialize the variables and placeholders. Then perform
the training and testing on dataset.
WHAT TO SUBMIT
You should submit a lab report file which include the steps you
preprocessed data, the necessary code snippet of your classifier and
architecture. Also, the screenshot for both your code snippet and the result
are needed. You can call your file "Lab2_phishing_yourname.doc".
Pr. Meryeme Ayache Page | 8