Final Doc Fin PDF
Final Doc Fin PDF
Submitted by
NITHISH. B (312820104057)
PRADEEP JOSHWA (312820104060)
PRAKASH RAJA. C (312820104061)
BACHELOR’s of ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
MAY 2024
ABSTRACT
These efforts are not merely technical enhancements; they are crucial for mitigating
the risk of data breaches and identity theft that loom over unsuspecting users. As
phishing attacks become more intricate and insidious, a proactive and adaptive
response is imperative to safeguard the digital ecosystem. Beyond the immediate
goal of threat mitigation, these optimizations contribute significantly to the broader
objective
of fostering user privacy, building trust, and preserving data integrity across diverse
online platforms. In essence, this project represents a critical stride towards
fortifying the digital realm against the ever-evolving and persistent menace of
phishing attacks.
TABLE OF CONTENTS
ABSTRACT 2
TABLE OF CONTENTS vi
LIST OF ABBREVIATIONS ix
1 INTRODUCTION 1
1.1 OVERVIEW 1
1.2 OBJECTIVE 3
1.3 DESCRIPTION 3
2 LITERATURE SURVEY 6
3.4 METHODOLOGIES 10
vi
3.4.5 XGBOOST MODEL 11
4 DESIGN PROCESS 13
5 IMPLEMENTATIONS 20
6 EXPERIMENTATION RESULTS 31
6.1 OBSERVATIONS 31
6.2 INFERENCES 33
REFERENCES 35
APPENDIX 39
A1 – SOURCE CODE 39
A2 – SCREENSHOTS 76
LIST OF FIGURES
vii
FIGURE NO. FIGURE NAME PAGE
NO.
viii
LIST OF ABBREVIATIONS
ML Machine Learning
AI Artificial Intelligence
RF Random Forest
LR Logistic Regression
DT Decision Tree
NB Naive Bayes
MITM Man-in-the-Middle
IP Internet Protocol
ix
MLP Multilayer Perceptron
x
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
Data science and machine learning are closely intertwined disciplines, often used in
conjunction to extract valuable insights from data, make predictions, and automate
decision-making processes. Let's delve into how they are integrated, particularly in
the context of classification tasks.
1
1.1.2 MACHINE LEARNING
Data Preparation: Data scientists play a crucial role in preparing the data for
classification tasks. They handle tasks such as cleaning noisy data, handling missing
values, and transforming data into a suitable format for machine learning algorithms.
Feature Engineering: Identifying and selecting relevant features (variables) from the
dataset is a critical step in classification. Data scientists use domain knowledge to
determine which features are most informative for the task at hand.
Training the Model: Machine learning models are trained on labelled datasets
during this phase. The model learns patterns and relationships between features and
labels, adjusting its parameters to make accurate predictions.
Evaluation and Validation: Data scientists are responsible for evaluating the
performance of the trained model using validation datasets. They use metrics such
as accuracy, precision, recall, and F1-score to assess how well the model
generalises to new, unseen data.
1.2 OBJECTIVE
The objective of this project is to develop and compare two machine learning models
for the task of detecting phishing websites. The primary focus is on enhancing the
accuracy and efficiency of phishing detection methods by using feature selection,
and also on evaluating the models based on their accuracy in order to determine
which one performs better for the given task.
Additionally, it also includes the potential integration of this ML-based system into
web browsers or as extensions, ensuring real-time warnings and protection for users
while browsing, ultimately fostering a safer and more secure online environment.
1.3 DESCRIPTION
Chapter 1 lays the groundwork for the entire project report. It provides an overview of
the project, including its purpose, scope, and significance. It also describes the
project in more detail, outlining the objectives and expected outcomes. Additionally, it
explains the structure of the report itself, giving the reader a roadmap to navigate the
information presented.
Chapter 3 tackles the heart of the project by defining the problem and outlining the
chosen approach to solve it. This chapter dives into the existing system, analyses its
limitations, and presents the proposed solution with its underlying algorithm. In
essence, it's the roadmap for tackling the challenge at hand.
Chapter 4 lays out the blueprint for a website with a chrome extension, from its
purpose and audience to its technical structure and development process. It starts
with an overview, then specifies requirements, details the architecture, and breaks
down the design step-by-step. Each module gets its own dedicated explanation,
while the conclusion wraps everything up and suggests potential future directions.
Essentially, this document serves as a comprehensive roadmap for bringing the
mobile app to life.
Chapter 5 chronicles the critical steps taken to construct and activate a powerful
phishing URL detection website. It showcases the development and deployment of
its core security features. It meticulously details the construction of these vital
safeguards, providing a comprehensive blueprint for those seeking to establish their
own phishing URL detection stronghold.
4
Chapter 6 discusses the significant lessons learned, and the necessary future
enhancements that can be used to improve the overall feasibility of the proposed
work.
5
CHAPTER 2
LITERATURE SURVEY
Lakshmana Rao Kalabarige et al., 2022 [1] propose a highly effective Multilayer
Stacked Ensemble Learning Model for phishing detection, achieving 96.79% to
98.90% accuracy. Outperforming baselines, the model underscores its efficacy with
improved metrics. The paper stresses the urgency of countering phishing, outlines
the model's architecture and results, and suggests future research on feature
selection and model optimization. Overall, the study introduces a potent detection
model, validates its effectiveness, and outlines avenues for further research.
Rasha Zeini et al., 2023 [4] reviews phishing detection methods, emphasising model
explainability, feature engineering, and domain knowledge. They identify gaps,
including URL shortening challenges, and stress the importance of reproducibility,
diverse datasets, and informed feature selection. The document offers insights into
6
evolving phishing tactics, highlighting the need for continuous research and user
education in effective countermeasures.
Al-Sarem et al., 2021 [5] presented an optimised stacking ensemble method for
phishing detection, employing Genetic Algorithm to determine optimal parameters.
The ensemble comprised algorithms like Random Forests, AdaBoost, XGBoost,
Bagging, GradientBoost, and LightGBM, applied to UCI Phishing, Mendely, and
Mendeley-small variant datasets. The model demonstrated remarkable accuracy of
97.16%, 98.58%, and 97.35% on the respective datasets, showcasing its
effectiveness across diverse phishing instances and features.
Yi Wei et al., 2022 [6] In 2022, Wei et al. compare machine learning and deep
learning methods for phishing website classification. They assess traditional
algorithms, ensemble methods, and deep learning models like Random Forest,
AdaBoost, LSTM, CNN, and RNN. Results emphasise ensemble methods'
effectiveness, particularly Random Forest, in achieving high accuracy and
computational efficiency, especially with reduced feature sets.
Ahmet Ozaday et al., 2022 [7] used six machine learning algorithms to classify URLs
based on eleven features, with Random Forest yielding the highest accuracy of
98.90%. Comparing various methods, they concluded Random Forest provided
consistent and superior performance. The study stressed the importance of updated
datasets, global collaboration, and user awareness in combating phishing.
A. Karim et al., 2023 [8] developed a phishing detection system employing various
machine learning models and a hybrid approach (LR+SVC+DT). The hybrid model
demonstrated high efficiency, utilising metrics like accuracy, precision, recall,
specificity, and F1-score. The study underscores the effectiveness of combining
listbased and machine-learning-based systems for more efficient phishing URL
detection.
M. Aljabri et al., 2022 [9] offers a thorough review of ML algorithms for detecting
malicious URLs, highlighting SVM, RF, DT, NB, and LR with accuracy surpassing
7
98.42%. The document underscores the effectiveness of ensemble techniques in
achieving over 90% accuracy and discusses challenges like sample sizes and
network traffic considerations. Providing insights into datasets, features, and model
accuracy, the study contributes to understanding and addressing unresolved issues
in malicious URL detection.
Priscilla Kyei Danso et al., 2022 [10] tackles IoT security challenges with an
Intelligent Ensemble-based IDS at the gateway, mitigating threats like MITM and
DoS. The proposed solution employs Naïve Bayes, SVM, and k-NN as base
learners, demonstrating efficacy through ensemble models on various datasets. The
study emphasises the importance of ensemble learning in IoT security and suggests
future directions for anomaly-based IDS improvements.
Pankaj Saraswat et al., 2022 [11] addresses email security challenges, focusing on
phishing detection with machine learning. Using SVM and Random Forest, the study
achieves a maximum accuracy of 96.87%, emphasising the need for effective
detection methods against evolving phishing techniques. The proposed system
extracts link, tag, and word-based features, underscoring the importance of dataset
expansion for real-world applicability.
8
CHAPTER 3
In this chapter, the project delves into the core by identifying the problem and
articulating the selected strategy for resolution. The chapter critically examines the
current system, putting forth its constraints, and introduces the solution, complete
with its underlying algorithm. Essentially, this section serves as a comprehensive
guide, outlining the path forward for addressing the project's central challenge.
9
phase with the Phishing dataset, data balancing phase, and the implementation
phase for executing the model effectively.
3.4 METHODOLOGIES
10
such as URL length, presence of HTTPS, use of IP addresses, presence of
suspicious keywords, and other relevant indicators are likely extracted. These
features provide valuable information for the machine learning models to learn and
make predictions effectively.
11
3.4.6 Metric Selection:
Evaluation metrics are used to assess the performance of the machine learning
models. Common metrics used in binary classification tasks like phishing detection
include precision, recall, F1 score, and accuracy.
• Precision measures the proportion of true phishing URLs among the URLs
predicted as phishing.
• Recall measures the proportion of true phishing URLs correctly identified by
the model.
• F1 score is the harmonic mean of precision.
• recall, and accuracy measures the overall correctness of the model's
predictions.
By using relevant evaluation metrics, the system can effectively evaluate and
compare the performance of the multilayer stacked ensemble model and the
XGBoost model to select the best-performing approach.
12
CHAPTER 4
DESIGN PROCESS
The proposed system offers a comprehensive solution for phishing URL detection,
with a dedicated website, and a convenient Chrome extension to strengthen user
protection while browsing the web. The dedicated website is the core of the system,
with an easy-to-use interface where users can input URLs to be analyzed. The
backend integrates top-notch machine learning models such as the multilayer stack
ensemble model and the xGBoost model. The inclusion of Fisher’s score in the
feature selection methodology improves the system’s ability to identify phishing
patterns by focusing on critical aspects that are not explicitly covered by the current
system.
13
providing timely warnings and alert users so they can make informed decisions while
navigating the web.
The system’s design focuses on the smooth integration of the dedicated website with
the Chrome extension to provide a holistic approach to the detection of phishing
URLs. Sophisticated communication channels ensure the secure exchange of data
between the website’s backend and Chrome extension, while preserving the user’s
privacy and the system’s reliability. Regular updates and syncing mechanisms
ensure that the machine learning model and detection algorithms are always up-
todate and effective against ever-evolving phishing tactics.
The user experience is at the forefront, with an intuitive website interface and
unobtrusive chrome extension, allowing users to conveniently access protection
features without interrupting their browsing. Real-time notifications from the
extension enable users to make smart decisions about the security of the URLs they
encounter. In conclusion, the proposed solution combines the best features of a
dedicated site with a Chrome extension to increase the effectiveness of phishing
URLs detection, while prioritising easy to use interactions and real time feedback to
actively protect users during their online engagements.
14
FIGURE 4.1 Data Flow Diagram
The key components of the phishing detection system include the user, who interacts
with the system through a Chrome Extension integrated with the web browser. The
user begins by registering and logging into the extension or a corresponding service.
When the user interacts with a website, the extension engages in URL analysis,
allowing the user to either manually paste a URL or automatically scanning links
when the cursor hovers over them. The core of the system is the Phishing Detection
System, which scrutinises the provided URL by comparing it against a database of
known phishing URLs, patterns, and characteristics. Determining whether the
website is likely a phishing attempt or legitimate, the system issues phishing alerts to
the user, potentially blocking access or displaying warnings if a suspicious activity is
detected. This comprehensive process aims to enhance online security by
proactively identifying and notifying users about potential phishing threats.
At the core of the system lies the Phishing Detection Model, serving as the
intelligence hub. This model is likely a machine learning model meticulously trained
to discern patterns and features characteristic of phishing websites. It plays a pivotal
role in the system's functionality, leveraging its learned knowledge to analyse
incoming data and determine whether a visited webpage poses a potential phishing
threat. In essence, the web application operates as a seamlessly integrated unit, with
the frontend facilitating user interaction through the Chrome Extension, while the
backend, with its content script, background script, API, and the Phishing Detection
Model, collectively ensures robust processing and accurate identification of potential
phishing websites.
16
Figure 4.3 Backend-based architecture diagram
17
4.4 PROJECT REQUIREMENTS
Google Colab is a cloud-based platform provided by Google that allows for the
creation and execution of Jupyter notebooks in a collaborative environment. It
provides free access to GPU resources, which can be beneficial for training machine
learning models. Google Colab is specified as a tool, indicating that the project may
leverage its cloud-based infrastructure for resource-intensive tasks, such as training
machine learning models on the specified dataset.
18
In summary, the specified software and hardware requirements are optimised to be
used with Windows as the required OS, with Jupyter and Google Colab as key
development tools. The hardware specifications include a high-performance
processor, SSD for fast storage, ample RAM for efficient multitasking, and a
dedicated GPU for accelerated machine learning tasks. These choices are aligned
with the computational demands of developing and implementing a phishing URL
detection system with machine learning models.
19
CHAPTER 5
IMPLEMENTATIONS
Chapter 5 chronicles the critical steps taken to construct and activate a powerful
phishing URL detection website. It showcases the development and deployment of
its core security features It meticulously details the construction of these vital
safeguards, providing a comprehensive blueprint for those seeking to establish their
phishing URL detection stronghold.
20
#Listing the features of the dataset
data.columns
21
# nunique value in columns
data.nunique()
#description of dataset
data.describe().T
22
Figure 5.6 Description of the data
23
5.2 DATA VISUALIZATION
#Correlation heatmap
plt.figure(figsize=(15,15))
sns.heatmap(data.corr(), annot=True)
plt.show()
This code generates a heatmap visualization of the correlation matrix of a data frame
using the seaborn (sns) library and matplotlib (plt). This code visualizes the
correlations between different columns in the DataFrame data using a heatmap,
where brighter colors represent stronger correlations (either positive or negative),
and darker colors represent weaker correlations or no correlation. The annotations
on the heatmap provide the exact correlation coefficients for each pair of columns.
#Phishing Count in a pie chart
24
data['class'].value_counts().plot(kind='pie',autopct='%1.2f%
%') plt.title("Phishing Count") plt.show()
This code generates a pie chart to visualize the distribution of the 'class' variable in
the DataFrame 'data'. This creates a pie chart that visually represents the proportion
of different classes (or categories) in the 'class' column of the DataFrame 'data'.
Each slice of the pie represents a unique class, and the size of each slice
corresponds to the frequency of that class in the dataset. The percentage values
displayed on the chart indicate the proportion of each class relative to the total
number of instances in the dataset.
# Splitting the dataset into train and test sets: 80-20 split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =
42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
25
Figure 5.9 Testing and Training Data
Parameters:
- X: numpy array, shape (n_samples,
n_features) Feature matrix.
- y: numpy array, shape (n_samples,)
Target vector.
Returns:
- scores: numpy array, shape (n_features,)
Fisher Scores for each feature.
"""
# Number of samples for each class classes,
class_counts = np.unique(y, return_counts=True)
n_classes = len(classes)
26
mean_class = np.mean(X_class, axis=0)
S_W += ((X_class - mean_class)**2).sum(axis=0)
S_B += class_counts[i] * ((mean_class - mean_overall)**2)
# Example usage
# Assuming X is your feature matrix and y is your target vector
fisher_scores = fisher_score(X, y)
# If you want to rank features based on Fisher
Score ranking = np.argsort(fisher_scores)[::-1]
print("Features ranked by Fisher Score:") for rank in
ranking:
print(f"Feature {rank} Score: {fisher_scores[rank]}")
27
#Fitting the models def
fit_models(models, X, y):
for model in models:
model.fit(X, y)
28
[ MLPClassifier(max_iter=1000,
random_state=42),
RandomForestClassifier(n_estimators=100, random_state=42),
xgb.XGBClassifier(random_state=42, use_label_encoder=False,
eval_metric='mlogloss')
]
# Train second layer models using the meta-features from the first layer
fit_models(models_layer2, X_train_meta_1, y_train)
# Generate second layer meta-features
X_train_meta_2 = generate_meta_features(models_layer2, X_train_meta_1)
X_test_meta_2 = generate_meta_features(models_layer2, X_test_meta_1)
#Third layer
final_layer_model = xgb.XGBClassifier(random_state=42, use_label_encoder=False,
eval_metric='mlogloss')
# Third layer training and final predictions
final_layer_model.fit(X_train_meta_2, y_train)
y_pred_final = final_layer_model.predict(X_test_meta_2)
29
Figure 5.12 Confusion matrix for the model
CHAPTER 6
30
EXPERIMENTATION RESULTS
Chapter 6 serves as a reflection on the essential lessons derived from our research
journey, highlighting the quest for improvement and innovation in ensemble learning
methodologies.
6.1 OBSERVATIONS
This chapter delves into the comparison of two powerful machine learning
algorithms, Multilayer Stacked Ensemble Learning Machine (MLSELM) and
XGBoost, for the task of phishing website detection. The MLSELM model,
comprising three layers of classifiers, outperformed the XGBoost model in terms of
accuracy. Through meticulous feature selection, the MLS-ELM achieved an
impressive accuracy of 97%, while the XGBoost model attained 86% accuracy.
The calculation of True Positive (NTP), True Negatives (NTN), False Positives (NFP),
and False Negatives (NFN) is outlined as follows:
- P: Total number of phishing instances
- N: Total number of legitimate instances
- NTN: Number of legitimate instances predicted as legitimate
- NFN: Number of phishing instances predicted as legitimate
- NTP: Number of phishing instances predicted as phishing
- NFP: Number of legitimate instances predicted as phishing
31
The computation of each metric is articulated as follows:
• Accuracy: Accuracy is the proportion of true positives (correctly identified
positive cases) out of the total number of cases examined.
((NTP + NTN) / (P + N)) × 100
• Precision: Precision is the proportion of true positives out of the total number
of positive cases identified.
(NTP / (NTP + NFP)) × 100
• Recall: Recall is the proportion of true positives out of the total number of
positive cases in the dataset.
(NTP / (NTP + NFN)) × 100
• F-score: Combines precision and recall into a single metric (Precision ×
Recall) / (Precision + Recall) × 100
The experimental setup involved training and evaluating MLS-ELM and XGBoost
models using the same dataset comprising features relevant to phishing website
detection. Both models underwent feature selection using Fisher’s Score to optimize
their performance. Both MLSELM and the XGBoost model are subjected to identical
dataset conditions to ensure a fair assessment of their capabilities. Furthermore, the
comparison encompasses evaluations with feature selection using Fisher’s Score,
providing insights into the impact of data preprocessing techniques on model
performance. This comparative analysis offers valuable insights into the relative
strengths and weaknesses of each approach, aiding in the selection of the most
suitable algorithm for phishing website detection tasks.
It is important to consider that the performance of these models may vary depending
on the specific dataset they are trained on and the types of phishing websites they
encounter.
6.2 INFERENCES
The superior performance of the MLSELM model can be attributed to its multilayer
stacked ensemble architecture, which leverages the collective intelligence of multiple
classifiers to make accurate predictions. By incorporating diverse base classifiers
and meta-learning techniques, MLSELM effectively captures the complex
relationships between features and the target variable, enhancing its discriminative
power. In contrast, while XGBoost is renowned for its scalability and efficiency, its
performance may be limited by its single-layer ensemble approach, which may
struggle to capture intricate patterns in the data.
33
CHAPTER 7
In envisioning the future enhancements for our phishing website detection model, we
are poised to revolutionize cybersecurity by imbuing it with self-learning and
selfupdating capabilities. By leveraging advanced machine learning algorithms and
innovative techniques, our model will autonomously adapt to emerging threats,
continuously refining its predictive capabilities without the need for manual
intervention. This transformative approach not only ensures real-time protection but
also alleviates the burden on administrators, freeing them from the tedious task of
manual updates.
34
REFERENCES
[2] Rasha Zieni , Luisa Masari , and Maria Carla Calzarossa – (2023) "Phishing or
Not Phishing? A Survey on the Detection of Phishing Websites"
[3] Yazan A. Al-Sarier, Victor Elijah Adeyemo, Abdullateef O. Balogun and Ammar
K. Alazzawi – (2020) "AI Meta-Learners and Extra-Trees Algorithm for the
Detection of Phishing Websites"
[8] A. Nazir et al., (2017) "Machine Learning-Based Phishing Detection Using URL
and Website Content Features," in Computers & Security, vol. 68, pp. 126-140,
doi:
35
10.1016/j.cose.2017.04.003.
[10] A. Kumar et al., (2015) "A Review of Machine Learning Approaches to Phishing
Detection," in Procedia Computer Science, vol. 48, pp. 96-104, doi:
10.1016/j.procs.2015.04.197.
[11] L. Liao et al., (2018) "Combating Phishing Using Trusted Features and
Machine
Learning," in Information Sciences, vol. 423, pp. 85-102, doi:
10.1016/j.ins.2017.10.005.
[12] H. Y. Son et al., (2019) "Phishing Website Detection Using Machine Learning
and Features Extracted from Website Images," in Journal of Information
Processing Systems, vol. 15, no. 1, pp. 117-133, doi: 10.3745/JIPS.03.0104.
[15] C. Singh et al., (2020) "A Machine Learning Approach for Detecting Phishing
Websites Using Neural Network," in Journal of King Saud University -
Computer and Information Sciences, doi: 10.1016/j.jksuci.2020.07.001.
36
[17] G. Li et al., (2019) "Phishing Website Detection Based on URL Features Using
Machine Learning," in IEEE Access, vol. 7, pp. 131577-131588, doi:
10.1109/ACCESS.2019.2936143.
[18] S. A. Alqahtani et al., (2018) "A Novel Approach for Phishing Detection Based
on Ensemble Learning," in International Journal of Advanced Computer
Science and Applications, vol. 9, no. 10, pp. 308-316, doi:
10.14569/IJACSA.2018.091044.
[24] How Hackers do Phishing Attacks to hack your accounts - YouTube Video by
Tech Raj: https://www.youtube.com/watch?v=RNzMKEYi2_0
[25] Feature Selection Techniques Easily Explained - YouTube Video by Krish Naik:
https://www.youtube.com/watch?v=EqLBAmtKMnQ
37
[26] What is a Phishing Attack? – Article by
IBM: https://www.ibm.com/topics/phishing
[27] How to Recognize and Avoid Phishing Scams –Article by Federal Trade
Commission:https://consumer.ftc.gov/articles/how-recognize-and-avoid-
phishingscams
[28] Phishing: Spot and report scam emails, texts, websites and calls –Article by
National Cybersecurity Centre - https://www.ncsc.gov.uk/collection/phishing-scams
[29] Multi-layer stacking ensemble learners for low footprint network intrusion
detection – Article by Springer Link:
https://link.springer.com/article/10.1007/s40747022-00809-3
38
APPENDIX
A1 - SOURCE CODE
Base.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
39
<!-- Required Js Scripts -->
{% block scripts %}
<script src="{{ url_for('static', filename='js/jquery.min.js') }}"></script>
<script src="{{ url_for('static', filename='js/bootstrap.bundle.min.js') }}"></script>
<script src="https://cdn.jsdelivr.net/npm/sweetalert2@11"></script>
{% endblock %}
</body>
</html>
Check.html
{% extends 'base.html' %}
{% block title %}Processing | Phishing Website Detector{% endblock %}
{% block content %}
<div class="card card--result color-bg-dark">
<img class="screenshot--target card-img rounded-0" alt="{{target}}" style="display:
none;">
<div class="screenshot--skeleton placeholder-glow">
<span class="placeholder"></span>
</div>
{% block scripts %}
{{ super() }} <script> function setTargetScreenshot() {
var targetScreenshot = $(".screenshot--target"); var
skeletonScreenshot = $(".screenshot--skeleton");
targetScreenshot.hide();
skeletonScreenshot.show();
41
var waterLevel = 100 - percentage;
$(".ball-percent").append($("<span>").text(percentage.toLocaleString('en-US', {
minimumFractionDigits: 0, maximumFractionDigits: 1,
}) + "%"));
$(".ball-water").css("top", waterLevel + "%");
$(".content--area").find(".placeholder").removeClass("placeholder");
$(".content--area").removeClass("placeholder-glow");
}
function phishedAlert()
{ Swal.fire({ title:
"Phished Website!!",
42
text: "It's too dangerous to
continue, hence we can't
allow this action.",
showCancelButton: true,
showConfirmButton: false,
showDenyButton: true,
denyButtonText: 'Back to
home'
}).then((result) => { if (result.isDenied)
{ window.location =
"{{ url_for('home') }}";
}
});
}
43
function invalidAlert(error)
{ Swal.fire({ icon: 'error', title:
'Oops! Something went wrong!',
text: error, input: 'url',
showDenyButton: true,
denyButtonText: 'Back to home',
confirmButtonText: 'Check again',
inputPlaceholder: 'Enter the URL',
allowOutsideClick:
false }).then((result) => { if
(result.isConfirmed)
{ window.location =
`{{ url_for('check') }}?target=${result.value}`;
} else if (result.isDenied)
{ window.location = "{{ url_for('home') }}";
}
});
}
$(function () {
setTargetScreenshot();
44
} else
{ showResult("");
invalidAlert(data.message);
}
},
});
});
$(window).on('resize', function ()
{ setTargetScreenshot();
});
</script>
{% endblock %}
Home.html
<!DOCTYPE html>
</head>
<body>
<header>
45
<a href="/url-detector" class="link">Phishing Url Detector</a>
</header>
46
<!-- <div class='footer'>
</div> -->
</body>
</html>
Index.html
{% extends 'base.html' %}
{% block title %}Phising Detector by Invaders{% endblock %}
{% block content %}
<section class="container min-vh-100">
<header>
<a href="/" class="link">Home Page</a>
</header>
</figure>
</div>
<form action="{{ url_for('check') }}" method="get" class="form--home input-group
rounded">
<input type="url" id="target-url" name="target" class="form-control form-
control-lg text-center border border-dark border-2 py-
47
3" placeholder="http://phish-site.com/malicious-url"
required />
48
{% block scripts %}
{{ super() }} <script> var eventSource = new
EventSource("{{ url_for('listen') }}");
eventSource.addEventListener(
"stats", function (e)
{ console.log(e.data);
data = JSON.parse(e.data);
$("#web-visits").text(data.visits);
$("#web-checked").text(data.checked);
$("#web-phished").text(data.phished);
},
true
);
</script>
{% endblock %}
Result.html
<!DOCTYPE html>
49
/>
<script
src="https://kit.fontawesome.com/5f3f547070.js"
crossorigin="anonymous"
></script>
<link
href="https://fonts.googleapis.com/css2?family=Roboto&display=swap"
rel="stylesheet"
/>
</head>
<body>
<div class="results">
<h1>PREDICTION RESULT</h1>
{% if prediction==1 %}
<h2>
<span class="danger"
>Caution! Our system has flagged this message as a possible phishing
attempt</span
>
</h2> <img class="image"
src="{{ url_for('static', filename='unsafe-icon.png') }}"
alt="SPAM Image"
/>
{% elif prediction==0 %}
<h2>
<span class="safe"
>Congratulations! This message is classified as SAFE</span
>
</h2>
<img class="image" src="{{ url_for('static',
filename='safety-icon.png') }}" alt="Not a spam
image"
50
/>
{% endif %}
</div>
</body>
</html>
Content detector.py
# Load the Multinomial Naive Bayes model and CountVectorizer object from disk
filename = 'spam-sms-mnb-model.pkl' classifier = pickle.load(open(filename,
'rb')) cv = pickle.load(open('cv-transform.pkl','rb')) app = Flask(__name__)
@app.route('/') def
home():
return render_template('home.html')
@app.route('/predict',methods=['POST'])
def predict():
if request.method == 'POST':
message =
request.form['message'] data =
[message] vect =
cv.transform(data).toarray()
my_prediction =
classifier.predict(vect) return
render_template('result.html',
prediction=my_prediction)
if __name__ == '__main__':
app.run(debug=True)
51
Features.py
# Exraction of features from the URL
# 0 having_IP_Address
# 1 URL_Length
# 2 Shortining_Service
# 3 having_At_Symbol
# 4 double_slash_redirecting
# 5 Prefix_Suffix
# 6 having_Sub_Domain
# 7 URL_Depth
# 8 Domain_registeration_length
# 9 Favicon
# 10 port
# 11 HTTPS_token
# 12 Request_URL
# 13 URL_of_Anchor
# 14 Links_in_tags
# 15 SFH
# 16 Submitting_to_email
# 17 Abnormal_URL
# 18 Redirect
# 19 on_mouseover
# 20 RightClick
# 21 popUpWidnow
# 22 Iframe
# 23 age_of_domain
# 24 DNSRecord
# 25 web_traffic
# Above fetatures function returns
# 1 if the URL is Phishing,
# -1 if the URL is Legitimate and
# 0 if the URL is Suspicious
52
import re import whois import
datetime import requests import
ipaddress from dns import
resolver from bs4 import
BeautifulSoup from urllib.parse
import urlparse
class FeatureExtraction:
def __init__(self, url):
self.url = url self.parsedurl =
urlparse(self.url) self.domain =
self.parsedurl.netloc
try
:
self.whois = whois.whois(self.domain)
except:
self.whois = None
try
:
self.request = requests.get(self.url, timeout=5, headers={
"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64
12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141
Safari/537.36"})
self.soup = BeautifulSoup(self.request.content, 'html.parser')
except:
self.request = None
self.soup = None
self.shortening_services = r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|
tinyurl|tr\.im|is\.gd|cli\.gs|" \
53
r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.u
s|" \
r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.
tt|" \
r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\
.gd|" \ r"tr\.im|
link\.zip\.net"
def getFeaturesDict(self):
return {
"having_IP_Address": self.having_IP_Address(),
"URL_Length": self.URL_Length(),
"Shortining_Service": self.Shortining_Service(),
"having_At_Symbol": self.having_At_Symbol(),
"double_slash_redirecting": self.double_slash_redirecting(),
"Prefix_Suffix": self.Prefix_Suffix(),
"having_Sub_Domain": self.having_Sub_Domain(),
"URL_Depth": self.URL_Depth(),
"Domain_registeration_length": self.Domain_registeration_length(),
"Favicon": self.Favicon(),
"port": self.port(),
"HTTPS_token": self.HTTPS_token(),
"Request_URL": self.Request_URL(),
"URL_of_Anchor": self.URL_of_Anchor(),
"Links_in_tags": self.Links_in_tags(),
"SFH": self.SFH(),
54
"Submitting_to_email": self.Submitting_to_email(),
"Abnormal_URL": self.Abnormal_URL(),
"Redirect": self.Redirect(),
"on_mouseover": self.on_mouseover(),
"RightClick": self.RightClick(),
"popUpWidnow": self.popUpWidnow(),
"Iframe": self.Iframe(),
"age_of_domain": self.age_of_domain(),
"DNSRecord": self.DNSRecord(),
"web_traffic": self.web_traffic()
}
def having_IP_Address(self):
try:
ipaddress.ip_address(self.domain)
return 1 except:
return -1
55
"""
def URL_Length(self):
if len(self.url) < 54:
return -1 elif len(self.url) >= 54 and
len(self.url) <= 75:
return 0
else:
return 1
"""#### ** Using URL Shortening Services “TinyURL” **
URL shortening is a method on the “World Wide Web” in which a URL may be
made considerably smaller in length and still lead to the required webpage. This is
accomplished by means of an “HTTP Redirect” on a domain name that is short,
which links to the webpage that has a long URL.
If the URL is using Shortening Services, the value assigned to this feature is 1
(phishing) or else -1 (legitimate).
"""
def Shortining_Service(self): if
re.search(self.shortening_services, self.url):
return 1
else:
return -1
56
def having_At_Symbol(self):
if '@' in self.url:
return 1
else:
return -1
def double_slash_redirecting(self):
if re.search(r'https?://[^\s]*//', self.url):
return 1
else:
return -1
def Prefix_Suffix(self):
if '-' in self.domain:
return 1
else:
return -1
57
"""#### ** SubDomains **
If the URL has more than 2 subdomains, the value assigned to this feature is 1
(phishing) or else 0 (suspicious) else -1 (legitimate).
"""
def having_Sub_Domain(self):
count = self.domain.count('.')
if count <= 2: return -1
elif count > 2 and count <= 3:
return 0
else:
return 1
def URL_Depth(self):
depth = 0 subdirs =
self.parsedurl.path.split('/') for
subdir in subdirs: if subdir:
depth += 1 return depth
58
"""
def Domain_registeration_length(self):
if self.whois is None:
return 1
try: if
type(self.whois['expiration_date']) is list:
expiration_date = self.whois['expiration_date'][0]
else:
expiration_date = self.whois['expiration_date']
registration_length = abs(
(expiration_date - datetime.datetime.now()).days)
if registration_length / 30 >= 6:
return -1
else:
return 1
except:
return 1
"""#### ** Favicon **
Checks for the presence of favicon in the website. The presence of favicon in the
website can be used as a feature to detect phishing websites.
If the website has favicon, the value assigned to this feature is 1 (phishing) or else
-1 (legitimate).
"""
def Favicon(self):
try:
if re.findall(r'favicon', self.soup.text) or \
self.soup.find('link', rel='shortcut icon') or \
self.soup.find('link', rel='icon'):
59
return -1
else:
return 1
except:
return 1
def port(self): if
self.parsedurl.port:
return 1
else:
return -1
def HTTPS_token(self):
if 'https' in self.domain:
return 1
else:
return -1
"""### ** Request_URL **
60
The fine line that distinguishes phishing websites from legitimate ones is how
many times a website has been redirected. In our dataset, we find that legitimate
websites have been redirected one time max. On the other hand, phishing websites
containing this feature have been redirected at least 4 times.
"""
"""#### ** URL_of_Anchor **
The presence of “<a>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<a>” tag in the URL.
If the URL has “<a>” tag, the value assigned to this feature is 1 (phishing) or else
1 (legitimate).
"""
def URL_of_Anchor(self):
try:
count = 0
for i in self.soup.find_all('a'):
if i.has_attr('href'):
count += 1
if count == 0:
return 1 else:
61
return -1
except:
return 1
"""#### ** Links_in_tags **
The presence of “<link>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<link>” tag in the URL.
If the URL has “<link>” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).
"""
def Links_in_tags(self):
try:
count = 0 for i in
self.soup.find_all('link'): if
i.has_attr('href'):
count += 1
if count == 0:
return 1 else:
return -1
except:
return 1
"""#### ** SFH **
The presence of “<form>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<form>” tag in the URL.
If the URL has “<form>” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).
"""
62
return 1
else:
return -1
except:
return 0
"""#### ** Submitting_to_email **
The presence of “mailto:” in the URL is a strong indicator of phishing websites.
This feature checks for the presence of “mailto:” in the URL.
If the URL has “mailto:” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).
"""
def Submitting_to_email(self):
try:
if self.soup.find('mailto:'):
return 1
else:
return -1
except:
return 0
"""#### ** Abnormal_URL **
The presence of “<script>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<script>” tag in the URL.
If the URL has “<script>” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).
"""
def Abnormal_URL(self):
try: if
re.findall(r'script|javascript|alert|onmouseover|onload|onerror|onclick|onmouse',
self.url):
63
return 1
else:
return -1
except:
return -1
"""#### ** Redirect **
The presence of “<meta>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<meta>” tag in the URL.
If the URL has “<meta>” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).
"""
def Redirect(self):
try:
if self.soup.find('meta', attrs={'http-equiv': 'refresh'}):
return 1
else:
return -1
except:
return -1
64
return 1
else:
return -1
except:
return -1
65
def popUpWidnow(self): try: if re.findall(r"alert\(|
onMouseOver|window.open", self.soup.text): return 1
else:
return -1
except:
return -1
66
def age_of_domain(self):
if self.whois is None:
return 1
try: if
type(self.whois['creation_date']) is list:
creation_date = self.whois['creation_date'][0]
else:
creation_date = self.whois['creation_date']
def DNSRecord(self):
try:
resolver.resolve(self.domain, 'A')
return -1 except: return 1
67
Web Information Company., 1996). By reviewing our dataset, we find that in worst
scenarios, legitimate websites ranked among the top 100,000. Furthermore, if the
domain has no traffic or is not recognized by the Alexa database, it is classified as
“Phishing”.
If the rank of the domain < 100000, the vlaue of this feature is 1 (phishing) else -1
(legitimate).
"""
def web_traffic(self):
try:
alexadata = BeautifulSoup(requests.get(
"http://data.alexa.com/data?cli=10&dat=s&url=" + self.domain,
timeout=10).content, 'lxml') rank = int(alexadata.find('reach')
['rank']) if rank < 100000:
return -1
else:
return 1
except:
return 1
68
h2i = Html2Image()
h2i.output_path = screenshot_dir
features_obj = FeatureExtraction(target_url) x =
pd.DataFrame.from_dict(features_obj.getFeaturesDict(), orient='index').T
69
pred_prob = model.predict_proba(x)[0]
safe_prob = pred_prob[0] unsafe_prob
= pred_prob[1]
if pred == 1:
update_stats('phished')
return dict( status=True,
domain=target.netloc,
target=target_url,
safe_percentage=safe_prob*100,
unsafe_percentage=unsafe_prob*100
)
except Exception as e:
return dict(status=False, message=str(e))
def get_stats(key=None):
stats = {} if
os.path.exists(stats_filename): with
open(stats_filename, "r") as file:
for line in file: (k, v) =
line.split(":") stats[k] = int(v)
return stats
return False
70
def update_stats(key): stats =
get_stats() with open("stats.txt",
"w+") as file: if stats is False:
file.write('\n'.join([f"{x}:0" for x in stats_params]))
else:
lines = [] avail_params =
list(stats_params) for k, v in
stats.items():
avail_params.remove(k) if k ==
key: v += 1
lines.append(f"{k}:{v}") if
len(avail_params) > 0: for
param in avail_params:
lines.append(f"{param}:0")
file.write("\n".join(lines)) file.flush()
Main.py from flask import
Flask from url_detector import
* from content_detector import
*
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict(): if request.method ==
'POST':
message = request.form['message'] data = [message]
vect = cv.transform(data).toarray() my_prediction =
71
classifier.predict(vect) return render_template('result.html',
prediction=my_prediction)
@app.route("/listen") def
listen(): def
respond_to_client():
while True:
stats = get_stats()
_data = json.dumps(
{"visits": stats['visits'], "checked": stats['checked'], "phished":
stats['phished']}) yield f"id: 1\ndata: {_data}\nevent: stats\n\n"
time.sleep(0.5)
@app.route("/screenshot") def
screenshot():
72
query = request.args if query
and query.get("target"):
target_url = query.get("target")
today_date = date.today()
width = default_screenshot_width
height = default_screenshot_height
ss_file_name = secure_filename(f"{target_url}-{today_date}-
{width}x{height}.png")
ss_file_path = os.path.join(screenshot_dir, ss_file_name)
if os.path.exists(ss_file_path):
return send_from_directory(screenshot_dir, path=ss_file_name)
if __name__ == '__main__':
app.run(debug=True)
app = Flask(__name__)
app.secret_key = os.urandom(12).hex()
73
default_screenshot_width = 1920
default_screenshot_height = 1080
@app.route('/') def
home():
update_stats('visits') return
render_template("index.html")
@app.route('/check', methods=['GET',
'POST']) def check():
update_stats('visits') if
request.method == "POST":
target_url = request.json['target'] result =
get_phishing_result(target_url=target_url) return
jsonify(result) target_url = request.args.get("target")
return render_template('check.html', target=target_url)
@app.route("/listen") def
listen():
def respond_to_client():
while True:
stats = get_stats()
_data = json.dumps(
{"visits": stats['visits'], "checked": stats['checked'], "phished":
stats['phished']}) yield f"id: 1\ndata: {_data}\
nevent: stats\n\n" time.sleep(0.5)
74
@app.route("/screenshot") def
screenshot():
query = request.args if query
and query.get("target"):
target_url = query.get("target")
today_date = date.today()
width = default_screenshot_width
height = default_screenshot_height
ss_file_name = secure_filename(f"{target_url}-{today_date}-
{width}x{height}.png") ss_file_path =
os.path.join(screenshot_dir, ss_file_name)
if os.path.exists(ss_file_path):
return send_from_directory(screenshot_dir, path=ss_file_name)
return capture_screenshot(target_url=target_url,
filename=ss_file_name, size=(width, height)) abort(404)
if __name__ == '__main__':
app.run(debug=True)
75
A2 – SCREENSHOTS
76
Figure A2.2 URL Detector page of the website
This image also shows the number of website visits, the total number of websites
checked, and the total number of phishing websites detected out of the total
websites checked.
77
Figure A2.5 Invalid phishing URL has been pasted
78
79