0% found this document useful (0 votes)

52 views87 pages

Final Doc Fin PDF

Uploaded by

pj222020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views87 pages

Final Doc Fin PDF

Uploaded by

pj222020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 87

PHISHARMOUR: A RESILIENT PHISHING WEBSITE

DETECTION WITH ENSEMBLE MODEL

A PROJECT REPORT

Submitted by

NITHISH. B (312820104057)
PRADEEP JOSHWA (312820104060)
PRAKASH RAJA. C (312820104061)

Under the guidance of

MS.AISHWARYA (ASST. PROFESSOR, CSE)

in partial fulfilment for the award of the degree of

BACHELOR’s of ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING

AGNI COLLEGE OF TECHNOLOGY

MAY 2024
ABSTRACT

Phishing attacks represent a formidable challenge in the realm of cybersecurity, with

attackers continually evolving their tactics to outsmart existing defense mechanisms.
These deceptive tactics involve the creation of replica websites that are
indistinguishable from legitimate ones, duping unsuspecting users into sharing
sensitive information such as usernames, passwords, and financial details. The
repercussions of such attacks extend far beyond their initial targets, with pilfered
credentials serving as potential entry points for unauthorised access across a
spectrum of popular online platforms. Despite the availability of countermeasures
like anti-phishing tools and browser extensions, the tenacity of these attacks
underscores the inadequacy of prevailing approaches. As the digital landscape
becomes more complex, there arises an urgent need to fortify defences against
phishing threats and to bolster the resilience of cybersecurity measures.

In response to the escalating sophistication of phishing attacks, this project is

dedicated to elevating the efficiency and computational performance of existing
phishing detection models. The overarching goal is a holistic optimization that not
only strengthens the ability to identify phishing attempts but also streamlines the
overall operational framework. This optimization journey involves a multifaceted
approach, encompassing the refinement of algorithms to enhance accuracy, the
improvement of data processing methods for swifter analysis, and the fortification of
models against emerging evasion techniques employed by cybercriminals. A pivotal
aspect of this endeavour is model pruning, a strategic reduction of hardware
workload without compromising the system's resilience. By achieving this delicate
balance, the project aims to ensure that the phishing detection system remains
adaptive and efficient in real-time, responding effectively to the dynamic landscape
of cyber threats.

These efforts are not merely technical enhancements; they are crucial for mitigating
the risk of data breaches and identity theft that loom over unsuspecting users. As
phishing attacks become more intricate and insidious, a proactive and adaptive
response is imperative to safeguard the digital ecosystem. Beyond the immediate
goal of threat mitigation, these optimizations contribute significantly to the broader
objective

of fostering user privacy, building trust, and preserving data integrity across diverse
online platforms. In essence, this project represents a critical stride towards
fortifying the digital realm against the ever-evolving and persistent menace of
phishing attacks.
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT 2

TABLE OF CONTENTS vi

LIST OF FIGURES vii

LIST OF ABBREVIATIONS ix

1 INTRODUCTION 1

1.1 OVERVIEW 1

1.1.1 DATA SCIENCE 1

1.1.2 MACHINE LEARNING 2

1.2 OBJECTIVE 3

1.3 DESCRIPTION 3

1.4 STRUCTURE OF THE PROJECT WORK 4

2 LITERATURE SURVEY 6

3 PROBLEM DEFINITION AND METHODOLOGIES 9

3.1 PROBLEM DEFINITION 9

3.2 EXISTING SYSTEM 9

3.3 PROPOSED SYSTEM 10

3.4 METHODOLOGIES 10

3.4.1 WEB SCRAPING AND DATA COLLECTION 10

3.4.2 FEATURE EXTRACTION 10

3.4.3 FISHER’S SCORE FOR FEATURE SELECTION 11

3.4.4 MULTILAYER STACKED ENSEMBLE MODEL 11

vi
3.4.5 XGBOOST MODEL 11

3.4.6 METRIC SELECTION 12

4 DESIGN PROCESS 13

4.1 DESIGN OVERVIEW 13

4.2 DATA FLOW DIAGRAM 14

4.3 ARCHITECTURE DIAGRAM 15

4.4 PROJECT REQUIREMENTS 18

4.4.1 SOFTWARE REQUIREMENTS 18

4.4.2 HARDWARE REQUIREMENTS 18

5 IMPLEMENTATIONS 20

5.1 ANALYSIS ON THE DATA 20

5.2 DATA VISUALIZATION 24

5.3 MODEL BUILDING AND TRAINING 25

6 EXPERIMENTATION RESULTS 31

6.1 OBSERVATIONS 31

6.2 INFERENCES 33

7 FUTURE WORK AND ENHANCEMENTS 34

REFERENCES 35

APPENDIX 39

A1 – SOURCE CODE 39

A2 – SCREENSHOTS 76

LIST OF FIGURES

vii
FIGURE NO. FIGURE NAME PAGE
NO.

FIGURE 4.1 DATA FLOW DIAGRAM 14

FIGURE 4.2 FRONT-END BASED ARCHITECTURE DIAGRAM 15

FIGURE 4.3 BACK-END BASED ARCHITECTURE DIAGRAM 17

FIGURE 5.1 DATA 20

FIGURE 5.2 SHAPE OF THE DATA 20

FIGURE 5.3 FEATURES OF THE DATA 21

FIGURE 5.4 INFORMATION ABOUT THE DATA 21

FIGURE 5.5 UNIQUE VALUES OF THE DATA 22

FIGURE 5.6 DESCRIPTION OF THE DATA 23

FIGURE 5.7 CORRELATION MAP OF THE DATA 24

FIGURE 5.8 PIE CHART FOR CLASSES OF THE DATA 25

FIGURE 5.9 TESTING AND TRAINING DATA 26

FIGURE 5.10 FISHER’S SCORE FOR THE FEATURES IN THE DATA 27

FIGURE 5.11 FINAL ACCURACY OF THE MODEL 29

FIGURE 5.12 CONFUSION MATRIX FOR THE MODEL 30

FIGURE 5.13 CLASSIFICATION REPORT OF THE MODEL 30

FIGURE 6.1 THE PERFORMANCE OF XGBOOST AND MLSELM MODEL 32

FIGURE A2.1 HOMEPAGE OF THE WEBSITE 77

FIGURE A2.2 URL DETECTOR PAGE OF THE WEBSITE 77

FIGURE A2.3 VALID PHISHING HAS BEEN PASTED 78

FIGURE A2.4 VALID PHISHING HAS BEEN DETECTED 78

FIGURE A2.5 INVALID PHISHING HAS BEEN PASTED 79

FIGURE A2.6 INVALID PHISHING HAS BEEN DETECTED 79

viii
LIST OF ABBREVIATIONS

ML Machine Learning

AI Artificial Intelligence

WEKA Waikato Environment for Knowledge Analysis

URL Uniform Resource Locator

MLSELM Multilayer Stacked Ensemble Learning Model

LSTM Long Short-Term Memory

RF Random Forest

CNN Neural Network

RNN Recurrent Neural Network

SVM Support Vector Machine

LR Logistic Regression

DT Decision Tree

NB Naive Bayes

SVC Support Vector Classifier

IDS Intrusion Detection System

MITM Man-in-the-Middle

DOS Denial of Service

KNN K-Nearest Neighbors

IOT Internet of Things

HTTPS Hypertext Transfer Protocol Secure

IP Internet Protocol

API Application Programming Interface

ix
MLP Multilayer Perceptron

XGB Extreme Gradient Boosting

RAM Random Access Memory

GPU Graphics Processing Unit

NTP Network Time Protocol

NTN Number of True Negatives

NFP Number of False Positives

NFN Number of False Negatives

x
CHAPTER 1

INTRODUCTION

1.1 OVERVIEW

Phishing attacks represent a fraudulent attempt to obtain confidential information by

posing as a legitimate entity. These attacks often involve fake emails, websites or
messages that trick users into revealing personal information such as passwords,
bank details, and credentials. Despite preventive measures, such as anti-phishing
tools, these attacks continue to evolve due to their evolving nature and methods. To
counter these persistent threats, the development of machine learning models for
phishing detection is gaining attention. ML models use algorithms and data analysis
to distinguish between legitimate and phishing websites. Features such as URL
structure, content, and user behavior patterns are analyzed to create predictive
models that can identify potential threats.
However, building ML models for phishing detection is difficult. Obtaining high-quality
data for training and testing models, ensuring model accuracy, adapting to new
phishing techniques, and balancing detection accuracy and computer effectiveness
are critical challenges.

Data science and machine learning are closely intertwined disciplines, often used in
conjunction to extract valuable insights from data, make predictions, and automate
decision-making processes. Let's delve into how they are integrated, particularly in
the context of classification tasks.

1.1.1 DATA SCIENCE

Data science is a broader field that encompasses a range of techniques and

methodologies to handle and analyse data. It involves processes such as data
collection, cleaning, exploration, and visualisation. The goal of data science is to
derive actionable insights and knowledge from data to support decision-making.

1
1.1.2 MACHINE LEARNING

Machine learning is a subset of artificial intelligence that focuses on the development

of algorithms and models that enable computers to learn patterns from data and
make predictions or decisions without explicit programming. It involves training
models on historical data to generalise and make accurate predictions on new,
unseen data.

Data Preparation: Data scientists play a crucial role in preparing the data for
classification tasks. They handle tasks such as cleaning noisy data, handling missing
values, and transforming data into a suitable format for machine learning algorithms.

Feature Engineering: Identifying and selecting relevant features (variables) from the
dataset is a critical step in classification. Data scientists use domain knowledge to
determine which features are most informative for the task at hand.

Model Selection: Machine learning practitioners, often working in collaboration with

data scientists, choose appropriate classification algorithms based on the nature of
the data and the problem. Common algorithms include logistic regression, decision
trees, support vector machines, and neural networks.

Training the Model: Machine learning models are trained on labelled datasets
during this phase. The model learns patterns and relationships between features and
labels, adjusting its parameters to make accurate predictions.

Evaluation and Validation: Data scientists are responsible for evaluating the
performance of the trained model using validation datasets. They use metrics such
as accuracy, precision, recall, and F1-score to assess how well the model
generalises to new, unseen data.

Iterative Process: The integration of data science and machine learning in

classification is often an iterative process. Data scientists and machine learning
practitioners collaborate to refine the model, adjust features, and improve overall
performance.
2
The seamless integration of data science and machine learning in classification tasks
allows organisations to automate decision-making processes, classify and categorise
data efficiently, and derive insights that contribute to informed decisionmaking. This
collaborative approach leverages the strengths of both fields to create robust and
accurate classification models.

1.2 OBJECTIVE

The objective of this project is to develop and compare two machine learning models
for the task of detecting phishing websites. The primary focus is on enhancing the
accuracy and efficiency of phishing detection methods by using feature selection,
and also on evaluating the models based on their accuracy in order to determine
which one performs better for the given task.

This objective aims to create a sophisticated solution capable of differentiating

between legitimate and deceptive websites by utilising a stacked ensemble machine
learning model, to effectively analyse patterns, URLs, and content structures.

Additionally, it also includes the potential integration of this ML-based system into
web browsers or as extensions, ensuring real-time warnings and protection for users
while browsing, ultimately fostering a safer and more secure online environment.

1.3 DESCRIPTION

This project employs machine learning algorithms to identify critical factors

influencing the detection of phishing websites. Two models, a Multilayer Stacked
Ensemble Model and an XGBoost Model, use techniques such as clustering and
classification to discern patterns in data.
Addressing the rising threat of phishing websites, the project aims to develop and
compare two robust machine learning models. The models are evaluated for
efficiency in detecting deceptive websites using Fisher's score for feature selection.
Additionally, a Chrome extension will integrate the superior model, offering real-time
warnings and enhanced user protection during online activities.
3
1.4 STRUCTURE OF THE PROJECT REPORT

Chapter 1 lays the groundwork for the entire project report. It provides an overview of
the project, including its purpose, scope, and significance. It also describes the
project in more detail, outlining the objectives and expected outcomes. Additionally, it
explains the structure of the report itself, giving the reader a roadmap to navigate the
information presented.

Chapter 2 provides a comprehensive overview of existing research and knowledge

related to the project's topic. It's essentially a review of relevant scholarly works,
books, articles, and other sources that inform the project's understanding of the
problem and potential solutions.

Chapter 3 tackles the heart of the project by defining the problem and outlining the
chosen approach to solve it. This chapter dives into the existing system, analyses its
limitations, and presents the proposed solution with its underlying algorithm. In
essence, it's the roadmap for tackling the challenge at hand.

Chapter 4 lays out the blueprint for a website with a chrome extension, from its
purpose and audience to its technical structure and development process. It starts
with an overview, then specifies requirements, details the architecture, and breaks
down the design step-by-step. Each module gets its own dedicated explanation,
while the conclusion wraps everything up and suggests potential future directions.
Essentially, this document serves as a comprehensive roadmap for bringing the
mobile app to life.

Chapter 5 chronicles the critical steps taken to construct and activate a powerful
phishing URL detection website. It showcases the development and deployment of
its core security features. It meticulously details the construction of these vital
safeguards, providing a comprehensive blueprint for those seeking to establish their
own phishing URL detection stronghold.

4
Chapter 6 discusses the significant lessons learned, and the necessary future
enhancements that can be used to improve the overall feasibility of the proposed
work.

5
CHAPTER 2

LITERATURE SURVEY

Lakshmana Rao Kalabarige et al., 2022 [1] propose a highly effective Multilayer
Stacked Ensemble Learning Model for phishing detection, achieving 96.79% to
98.90% accuracy. Outperforming baselines, the model underscores its efficacy with
improved metrics. The paper stresses the urgency of countering phishing, outlines
the model's architecture and results, and suggests future research on feature
selection and model optimization. Overall, the study introduces a potent detection
model, validates its effectiveness, and outlines avenues for further research.

Al-Sariera et al., 2019 [2] presented advanced AI meta-learner models like

AdaBoost-Extra Tree for phishing detection, achieving over 97% accuracy with
minimal false positives (below 0.028). The study critically reviewed existing methods,
emphasising the need for improved techniques. Thorough evaluation using 10-fold
cross-validation and WEKA software demonstrated the models' superiority in
accuracy and predictive capabilities over existing methods. The paper highlighted
the importance of interpretable AI models and suggested exploring alternative
decision tree algorithms and hybridization methods for future advancements.

Ayman El Aassal et al., 2020 [3] conducted a thorough benchmarking of phishing

detection research, emphasising the significance of diverse datasets and real-time
detection. They introduced PhishBench, a systematic framework for evaluating and
comparing detection methods, addressing challenges like imbalanced scenarios.
The study covers benchmarking frameworks, feature importance, and the
architecture of PhishBench, advocating for comprehensive datasets in phishing
detection research.

Rasha Zeini et al., 2023 [4] reviews phishing detection methods, emphasising model
explainability, feature engineering, and domain knowledge. They identify gaps,
including URL shortening challenges, and stress the importance of reproducibility,
diverse datasets, and informed feature selection. The document offers insights into
6
evolving phishing tactics, highlighting the need for continuous research and user
education in effective countermeasures.

Al-Sarem et al., 2021 [5] presented an optimised stacking ensemble method for
phishing detection, employing Genetic Algorithm to determine optimal parameters.
The ensemble comprised algorithms like Random Forests, AdaBoost, XGBoost,
Bagging, GradientBoost, and LightGBM, applied to UCI Phishing, Mendely, and
Mendeley-small variant datasets. The model demonstrated remarkable accuracy of
97.16%, 98.58%, and 97.35% on the respective datasets, showcasing its
effectiveness across diverse phishing instances and features.

Yi Wei et al., 2022 [6] In 2022, Wei et al. compare machine learning and deep
learning methods for phishing website classification. They assess traditional
algorithms, ensemble methods, and deep learning models like Random Forest,
AdaBoost, LSTM, CNN, and RNN. Results emphasise ensemble methods'
effectiveness, particularly Random Forest, in achieving high accuracy and
computational efficiency, especially with reduced feature sets.

Ahmet Ozaday et al., 2022 [7] used six machine learning algorithms to classify URLs
based on eleven features, with Random Forest yielding the highest accuracy of
98.90%. Comparing various methods, they concluded Random Forest provided
consistent and superior performance. The study stressed the importance of updated
datasets, global collaboration, and user awareness in combating phishing.

A. Karim et al., 2023 [8] developed a phishing detection system employing various
machine learning models and a hybrid approach (LR+SVC+DT). The hybrid model
demonstrated high efficiency, utilising metrics like accuracy, precision, recall,
specificity, and F1-score. The study underscores the effectiveness of combining
listbased and machine-learning-based systems for more efficient phishing URL
detection.

M. Aljabri et al., 2022 [9] offers a thorough review of ML algorithms for detecting
malicious URLs, highlighting SVM, RF, DT, NB, and LR with accuracy surpassing

7
98.42%. The document underscores the effectiveness of ensemble techniques in
achieving over 90% accuracy and discusses challenges like sample sizes and
network traffic considerations. Providing insights into datasets, features, and model
accuracy, the study contributes to understanding and addressing unresolved issues
in malicious URL detection.

Priscilla Kyei Danso et al., 2022 [10] tackles IoT security challenges with an
Intelligent Ensemble-based IDS at the gateway, mitigating threats like MITM and
DoS. The proposed solution employs Naïve Bayes, SVM, and k-NN as base
learners, demonstrating efficacy through ensemble models on various datasets. The
study emphasises the importance of ensemble learning in IoT security and suggests
future directions for anomaly-based IDS improvements.

Pankaj Saraswat et al., 2022 [11] addresses email security challenges, focusing on
phishing detection with machine learning. Using SVM and Random Forest, the study
achieves a maximum accuracy of 96.87%, emphasising the need for effective
detection methods against evolving phishing techniques. The proposed system
extracts link, tag, and word-based features, underscoring the importance of dataset
expansion for real-world applicability.

F. Castaño et al., 2023 [12] introduces PhiKitA-500, a dataset linking phishing

websites to phishing kits, facilitating algorithm evaluation. The methodology involves
stages like source definition, phishing kit collection, website extraction, and
postprocessing. Results indicate successful grouping of phishing kits, demonstrating
the utility of kit information in classifying phishing attacks, despite challenges in
multiclass classification.

8
CHAPTER 3

PROBLEM DEFINITION AND METHODOLOGIES

In this chapter, the project delves into the core by identifying the problem and
articulating the selected strategy for resolution. The chapter critically examines the
current system, putting forth its constraints, and introduces the solution, complete
with its underlying algorithm. Essentially, this section serves as a comprehensive
guide, outlining the path forward for addressing the project's central challenge.

3.1 PROBLEM DEFINITION

The escalating threat of phishing websites poses a significant challenge to online

security. Existing detection methods require enhancement to discern deceptive sites
effectively. This project addresses the need for robust machine learning models
capable of identifying critical factors influencing phishing website detection. The lack
of a comprehensive solution, coupled with the urgency to protect users from evolving
phishing techniques, necessitates the development and comparison of advanced
models. The challenge lies in creating models that not only exhibit high efficiency in
detecting deceptive websites but also integrate seamlessly into user workflows,
providing real-time warnings and enhancing online security.

3.2 EXISTING SYSTEM

The existing model for phishing detection is a Layer-wise Stacked Ensemble

Learning architecture, comprising multiple layers of estimators culminating in a
metalearner. The workflow involves initialising the model, creating multiple layers
with diverse estimators, and adding a meta-learner as the final layer for
comprehensive decision-making. The Stacked Ensemble Learning process involves
running estimators in parallel within layers and sequentially between layers,
employing various models like Random Forest and Logistic Regression. The phases
of the Multilayer Stacked Ensemble Learning Model (MLSELM) include the input

9
phase with the Phishing dataset, data balancing phase, and the implementation
phase for executing the model effectively.

3.3 PROPOSED SYSTEM

The proposed system incorporates a dedicated website aimed at detecting phishing

URLs, complemented by the introduction of a user-friendly Chrome extension
available for download to enhance the detection of potential phishing URLs. To
improve feature selection, the system integrates Fisher's score, a methodology not
implemented in the existing system. Through this enhancement, the comparison
between the Multilayer Stacked Ensemble Model and the XGBoost Model is done to
evaluate their efficacy in the context of phishing URL detection, addressing a critical
aspect of online security that was not explicitly covered in the existing model. The
Chrome extension's functionality includes real-time pop-up notifications when
hovering over URLs, signaling potential phishing attempts. By combining the
capabilities of the integrated website and Chrome extension, the system actively
protects users during browsing activities.

3.4 METHODOLOGIES

3.4.1 Web Scraping and Data Collection:

Web scraping is a technique used to extract data from websites. In this context, the
system employs web scraping to collect a large dataset of URLs for both phishing
websites and legitimate websites. This process involves crawling known phishing
websites, which are sites designed to steal sensitive information such as login
credentials or financial data from users, as well as legitimate websites to ensure a
comprehensive dataset. By collecting URLs from both categories, the system can
train machine learning models to differentiate between phishing and legitimate URLs
effectively.

3.4.2 Feature Extraction:

Feature extraction involves identifying and extracting relevant information from the
collected URLs to create a comprehensive feature set for model training. Features

10
such as URL length, presence of HTTPS, use of IP addresses, presence of
suspicious keywords, and other relevant indicators are likely extracted. These
features provide valuable information for the machine learning models to learn and
make predictions effectively.

3.4.3 Fisher's Score for Feature Selection:

Fisher's score is a statistical measure used for feature selection, helping identify the
most informative features for training the models. In this step, features with higher
Fisher's scores are considered more relevant and are selected for model training,
while less informative features are discarded. By focusing on the most informative
features, the system improves the performance of the machine learning models by
reducing noise and irrelevant information in the dataset.

3.4.4 Multilayer Stacked Ensemble Model:

A multilayer stacked ensemble model involves combining multiple machine learning
models in a stacked architecture. Each layer of the ensemble may use different
algorithms, such as decision trees, logistic regression, or neural networks, to learn
from the data and make predictions. The predictions from each layer are then
combined, typically using a meta-learner or aggregation method, to produce the final
output. This approach improves the overall performance of the model by leveraging
the strengths of multiple algorithms and capturing complex relationships in the data.

3.4.5 XGBoost Model:

XGBoost is a popular and powerful gradient-boosting algorithm commonly used in
classification tasks. It works by iteratively training decision trees to correct the errors
made by previous trees, leading to a highly accurate predictive model. In this
methodology, the system implements an XGBoost model as an alternative approach
for comparison with the multilayer stacked ensemble model. This allows for
evaluating the performance of different algorithms and selecting the best-performing
model for the task.

11
3.4.6 Metric Selection:
Evaluation metrics are used to assess the performance of the machine learning
models. Common metrics used in binary classification tasks like phishing detection
include precision, recall, F1 score, and accuracy.

• Precision measures the proportion of true phishing URLs among the URLs
predicted as phishing.
• Recall measures the proportion of true phishing URLs correctly identified by
the model.
• F1 score is the harmonic mean of precision.
• recall, and accuracy measures the overall correctness of the model's
predictions.

By using relevant evaluation metrics, the system can effectively evaluate and
compare the performance of the multilayer stacked ensemble model and the
XGBoost model to select the best-performing approach.

12
CHAPTER 4

DESIGN PROCESS

This chapter serves as a detailed guide for constructing a website accompanied by a

Chrome extension, outlining its purpose, target audience, technical framework, and
developmental procedures. The chapter begins with a broad overview, followed by a
precise delineation of requirements and an in-depth exploration of the architectural
blueprint.

4.1 DESIGN OVERVIEW

The proposed system offers a comprehensive solution for phishing URL detection,
with a dedicated website, and a convenient Chrome extension to strengthen user
protection while browsing the web. The dedicated website is the core of the system,
with an easy-to-use interface where users can input URLs to be analyzed. The
backend integrates top-notch machine learning models such as the multilayer stack
ensemble model and the xGBoost model. The inclusion of Fisher’s score in the
feature selection methodology improves the system’s ability to identify phishing
patterns by focusing on critical aspects that are not explicitly covered by the current
system.

This website is designed to integrate seamlessly with the Chrome extension,

providing a unified and synchronous user experience. The Chrome extension is
available for download and provides an extra layer of protection while online. One of
the key features of this extension is the real-time pops-up notifications that appear
when a user hover over a URL to alert them of a potential phishing attempt. These
notifications act as a direct link between the extension and the dedicated website,

13
providing timely warnings and alert users so they can make informed decisions while
navigating the web.

The system’s design focuses on the smooth integration of the dedicated website with
the Chrome extension to provide a holistic approach to the detection of phishing
URLs. Sophisticated communication channels ensure the secure exchange of data
between the website’s backend and Chrome extension, while preserving the user’s
privacy and the system’s reliability. Regular updates and syncing mechanisms
ensure that the machine learning model and detection algorithms are always up-
todate and effective against ever-evolving phishing tactics.

The user experience is at the forefront, with an intuitive website interface and
unobtrusive chrome extension, allowing users to conveniently access protection
features without interrupting their browsing. Real-time notifications from the
extension enable users to make smart decisions about the security of the URLs they
encounter. In conclusion, the proposed solution combines the best features of a
dedicated site with a Chrome extension to increase the effectiveness of phishing
URLs detection, while prioritising easy to use interactions and real time feedback to
actively protect users during their online engagements.

4.2 DATA FLOW DIAGRAM

14
FIGURE 4.1 Data Flow Diagram

The key components of the phishing detection system include the user, who interacts
with the system through a Chrome Extension integrated with the web browser. The
user begins by registering and logging into the extension or a corresponding service.
When the user interacts with a website, the extension engages in URL analysis,
allowing the user to either manually paste a URL or automatically scanning links
when the cursor hovers over them. The core of the system is the Phishing Detection
System, which scrutinises the provided URL by comparing it against a database of
known phishing URLs, patterns, and characteristics. Determining whether the
website is likely a phishing attempt or legitimate, the system issues phishing alerts to
the user, potentially blocking access or displaying warnings if a suspicious activity is
detected. This comprehensive process aims to enhance online security by
proactively identifying and notifying users about potential phishing threats.

4.3 ARCHITECTURE DIAGRAM

Figure 4.2 Front end-based architecture diagram

15
The web application is the overarching entity, encompassing two main divisions: the
frontend and the backend. The frontend is the user-facing component primarily
accessed through the Chrome Extension. Within the frontend, the Content Script
operates as a script embedded in the visited webpage, providing access to the
webpage's content such as text, images, and structure to gather essential data. On
the other hand, the backend serves as the processing powerhouse behind the
scenes, managing critical functions. The Background Script, integrated into the
Chrome Extension, acts as a coordinator facilitating communication between the
content script and the backend by passing data seamlessly. Additionally, an essential
element of the backend is the API, functioning as an interface that enables the
detection model within the backend to receive data and convey evaluations
effectively.

At the core of the system lies the Phishing Detection Model, serving as the
intelligence hub. This model is likely a machine learning model meticulously trained
to discern patterns and features characteristic of phishing websites. It plays a pivotal
role in the system's functionality, leveraging its learned knowledge to analyse
incoming data and determine whether a visited webpage poses a potential phishing
threat. In essence, the web application operates as a seamlessly integrated unit, with
the frontend facilitating user interaction through the Chrome Extension, while the
backend, with its content script, background script, API, and the Phishing Detection
Model, collectively ensures robust processing and accurate identification of potential
phishing websites.

16
Figure 4.3 Backend-based architecture diagram

A stacked ensemble learning model is a machine learning technique that uses

multiple models to improve the performance of a single model.
The stacked ensemble learning model used in this diagram is a multilayer stacked
ensemble learning model. A multilayer stacked ensemble learning model is a type of
stacked ensemble learning model that uses multiple layers of models. The first layer
of the model consists of five different machine learning models: MLP, KNN, RF, LR,
and XGB. These models are all trained on the training dataset. The second layer of
the model consists of two models. These models are trained on the outputs of the
first layer of models. The third layer of the model consists of a single XGB model,
also called a meta learner. This model is trained on the outputs of the second layer
of models. The final output of the model is a prediction of whether an email is
phishing or legitimate.
The stacked ensemble learning model can improve the performance of a single
machine learning model by combining the strengths of multiple models.

17
4.4 PROJECT REQUIREMENTS

This segment details the precise technological prerequisites essential for

implementing the project. The subsequent content outlines the necessary software
and hardware requirements crucial for the successful execution of this initiative.

4.4.1 SOFTWARE REQUIREMENTS

● Operating System: Windows

● Tools Used: Jupyter, Google Colab, Visual Studio.

4.4.2 HARDWARE REQUIREMENTS

● Processor: AMD Ryzen 7 5800H

● Hard Disk : 500GB SSD
● RAM : 16GB SODIMM RAM
● GPU: NVIDIA GeForce RTX 3050

Jupyter is an open-source tool that allows interactive computing and supports

various programming languages. It's widely used for creating and sharing documents
containing live code, equations, visualizations, and narrative text. Jupyter is likely
used for developing and testing code, especially for tasks like data preprocessing,
feature extraction, and initial model training. Its interactive nature facilitates iterative
development and experimentation.

Google Colab is a cloud-based platform provided by Google that allows for the
creation and execution of Jupyter notebooks in a collaborative environment. It
provides free access to GPU resources, which can be beneficial for training machine
learning models. Google Colab is specified as a tool, indicating that the project may
leverage its cloud-based infrastructure for resource-intensive tasks, such as training
machine learning models on the specified dataset.

18
In summary, the specified software and hardware requirements are optimised to be
used with Windows as the required OS, with Jupyter and Google Colab as key
development tools. The hardware specifications include a high-performance
processor, SSD for fast storage, ample RAM for efficient multitasking, and a
dedicated GPU for accelerated machine learning tasks. These choices are aligned
with the computational demands of developing and implementing a phishing URL
detection system with machine learning models.

19
CHAPTER 5

IMPLEMENTATIONS

Chapter 5 chronicles the critical steps taken to construct and activate a powerful
phishing URL detection website. It showcases the development and deployment of
its core security features It meticulously details the construction of these vital
safeguards, providing a comprehensive blueprint for those seeking to establish their
phishing URL detection stronghold.

5.1 ANALYSIS ON THE DATA

#Loading data into data frame

data = pd.read_csv("phishing.csv")
data.head()

Figure 5.1 Data

#Shape of data frame

data.shape

Figure 5.2 Shape of the data

20
#Listing the features of the dataset
data.columns

Figure 5.3 Features of the data

#Information about the dataset

data.info()

Figure 5.4 Information of the data

21
# nunique value in columns
data.nunique()

Figure 5.5 Unique values of the data

#droping index column

data = data.drop(['Index'],axis = 1)

#description of dataset
data.describe().T

22
Figure 5.6 Description of the data

• There are 11054 instances and 31 features in the dataset.

• Out of which 30 are independent features whereas 1 is a dependent feature.
• There is no outlier present in the dataset.
• There is no missing value in the dataset.

23
5.2 DATA VISUALIZATION

#Correlation heatmap
plt.figure(figsize=(15,15))
sns.heatmap(data.corr(), annot=True)
plt.show()

Figure 5.7 Correlation map of the data

This code generates a heatmap visualization of the correlation matrix of a data frame
using the seaborn (sns) library and matplotlib (plt). This code visualizes the
correlations between different columns in the DataFrame data using a heatmap,
where brighter colors represent stronger correlations (either positive or negative),
and darker colors represent weaker correlations or no correlation. The annotations
on the heatmap provide the exact correlation coefficients for each pair of columns.
#Phishing Count in a pie chart

24
data['class'].value_counts().plot(kind='pie',autopct='%1.2f%
%') plt.title("Phishing Count") plt.show()

Figure 5.8 Pie chart for classes of the data

This code generates a pie chart to visualize the distribution of the 'class' variable in
the DataFrame 'data'. This creates a pie chart that visually represents the proportion
of different classes (or categories) in the 'class' column of the DataFrame 'data'.
Each slice of the pie represents a unique class, and the size of each slice
corresponds to the frequency of that class in the dataset. The percentage values
displayed on the chart indicate the proportion of each class relative to the total
number of instances in the dataset.

5.3 MODEL BUILDING AND TRAINING

# Splitting the dataset into dependent and independent

feature X = data.drop(["class"],axis =1) y = data["class"]

# Splitting the dataset into train and test sets: 80-20 split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =
42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

25
Figure 5.9 Testing and Training Data

#Defining Fisher’s Score

def fisher_score(X, y):
"""
Compute the Fisher Score for each feature.

Parameters:
- X: numpy array, shape (n_samples,
n_features) Feature matrix.
- y: numpy array, shape (n_samples,)
Target vector.

Returns:
- scores: numpy array, shape (n_features,)
Fisher Scores for each feature.
"""
# Number of samples for each class classes,
class_counts = np.unique(y, return_counts=True)
n_classes = len(classes)

# Overall mean of each feature

mean_overall = np.mean(X, axis=0)

# Within and between class scatter

S_W = np.zeros(X.shape[1])
S_B = np.zeros(X.shape[1])

for i, label in enumerate(classes):

X_class = X[y == label]

26
mean_class = np.mean(X_class, axis=0)
S_W += ((X_class - mean_class)**2).sum(axis=0)
S_B += class_counts[i] * ((mean_class - mean_overall)**2)

# Compute Fisher Score

scores = S_B / S_W
return scores

# Example usage
# Assuming X is your feature matrix and y is your target vector
fisher_scores = fisher_score(X, y)
# If you want to rank features based on Fisher
Score ranking = np.argsort(fisher_scores)[::-1]
print("Features ranked by Fisher Score:") for rank in
ranking:
print(f"Feature {rank} Score: {fisher_scores[rank]}")

Figure 5.10 Fisher’s score for the features in the data

27
#Fitting the models def
fit_models(models, X, y):
for model in models:
model.fit(X, y)

def generate_meta_features(models, X):

meta_features = [model.predict_proba(X) for model in models]
return np.hstack(meta_features)

# Mapping -1 to 0 and 1 remains 1

y_train = y_train.map({-1: 0, 1: 1})
y_test = y_test.map({-1: 0, 1: 1})

# Initialize a label encoder and fit it to the training

labels label_encoder = LabelEncoder() y_train =
label_encoder.fit_transform(y_train) y_test =
label_encoder.transform(y_test)

#First layer models models_layer1 =

[ xgb.XGBClassifier(random_state=42,
use_label_encoder=False, eval_metric='mlogloss'),
MLPClassifier(max_iter=1000, random_state=42),
KNeighborsClassifier(n_neighbors=5),
RandomForestClassifier(n_estimators=100, random_state=42),
LogisticRegression(max_iter=1000, random_state=42)
]
# First layer training and meta-features generation
fit_models(models_layer1, X_train, y_train)
X_train_meta_1 = generate_meta_features(models_layer1, X_train)
X_test_meta_1 = generate_meta_features(models_layer1, X_test)

# Define the second layer models

models_layer2 =

28
[ MLPClassifier(max_iter=1000,
random_state=42),

RandomForestClassifier(n_estimators=100, random_state=42),
xgb.XGBClassifier(random_state=42, use_label_encoder=False,
eval_metric='mlogloss')
]
# Train second layer models using the meta-features from the first layer
fit_models(models_layer2, X_train_meta_1, y_train)
# Generate second layer meta-features
X_train_meta_2 = generate_meta_features(models_layer2, X_train_meta_1)
X_test_meta_2 = generate_meta_features(models_layer2, X_test_meta_1)

#Third layer
final_layer_model = xgb.XGBClassifier(random_state=42, use_label_encoder=False,
eval_metric='mlogloss')
# Third layer training and final predictions
final_layer_model.fit(X_train_meta_2, y_train)
y_pred_final = final_layer_model.predict(X_test_meta_2)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred_final)
print(f'Final Model Accuracy: {accuracy}')

Figure 5.11 Final accuracy of the model

# Generate the confusion matrix

conf_matrix = confusion_matrix(y_test,
y_pred_final) print("Confusion Matrix:")
print(conf_matrix)

29
Figure 5.12 Confusion matrix for the model

# Generate classification report

class_report = classification_report(y_test,
y_pred_final) print("Classification Report:")
print(class_report)

Figure 5.13 Classification report of the model

CHAPTER 6
30
EXPERIMENTATION RESULTS

Chapter 6 serves as a reflection on the essential lessons derived from our research
journey, highlighting the quest for improvement and innovation in ensemble learning
methodologies.

6.1 OBSERVATIONS

This chapter delves into the comparison of two powerful machine learning
algorithms, Multilayer Stacked Ensemble Learning Machine (MLSELM) and
XGBoost, for the task of phishing website detection. The MLSELM model,
comprising three layers of classifiers, outperformed the XGBoost model in terms of
accuracy. Through meticulous feature selection, the MLS-ELM achieved an
impressive accuracy of 97%, while the XGBoost model attained 86% accuracy.

In this, we evaluate the performance of the proposed Multilayer Stacked Ensemble

Learning Machine (MLSELM) alongside several established machine learning
algorithms, including Multi-Layer Perceptron (MLP), k-nearest neighbours (KNN),
Random Forest (RF), Logistic Regression (LR), and XGBoost (XGB) and XGBoost
model separately. The classification metrics utilized for performance evaluation
encompass Precision, Recall, F-score, and Accuracy. In the context of distinguishing
between Legitimate and Phishing instances, Phishing instances are designated as
positive, while Legitimate instances are termed negative.

The calculation of True Positive (NTP), True Negatives (NTN), False Positives (NFP),
and False Negatives (NFN) is outlined as follows:
- P: Total number of phishing instances
- N: Total number of legitimate instances
- NTN: Number of legitimate instances predicted as legitimate
- NFN: Number of phishing instances predicted as legitimate
- NTP: Number of phishing instances predicted as phishing
- NFP: Number of legitimate instances predicted as phishing

31
The computation of each metric is articulated as follows:
• Accuracy: Accuracy is the proportion of true positives (correctly identified
positive cases) out of the total number of cases examined.
((NTP + NTN) / (P + N)) × 100
• Precision: Precision is the proportion of true positives out of the total number
of positive cases identified.
(NTP / (NTP + NFP)) × 100
• Recall: Recall is the proportion of true positives out of the total number of
positive cases in the dataset.
(NTP / (NTP + NFN)) × 100
• F-score: Combines precision and recall into a single metric (Precision ×
Recall) / (Precision + Recall) × 100

The experimental setup involved training and evaluating MLS-ELM and XGBoost
models using the same dataset comprising features relevant to phishing website
detection. Both models underwent feature selection using Fisher’s Score to optimize
their performance. Both MLSELM and the XGBoost model are subjected to identical
dataset conditions to ensure a fair assessment of their capabilities. Furthermore, the
comparison encompasses evaluations with feature selection using Fisher’s Score,
providing insights into the impact of data preprocessing techniques on model
performance. This comparative analysis offers valuable insights into the relative
strengths and weaknesses of each approach, aiding in the selection of the most
suitable algorithm for phishing website detection tasks.

MEASURES XGBOOST MLSELM

ACCURACY 86 97
PRECISION 93 97
RECALL
79 96
F1 SCORE
85 96

Figure 6.1 The performance of XGBoost algorithm & MLSELM algorithm

32
In this section, we are analysing the models based on the results in the table, the
MLSELM model outperforms the XGBoost model in all four metrics. It has a
significantly higher accuracy rate (97% compared to 86%), meaning it correctly
classified a much larger proportion of websites. MLSELM also has superior precision
(97% compared to 93%) and recall (96% compared to 79%), indicating it made fewer
mistakes in classifying websites and identified a larger proportion of actual phishing
sites. Finally, it has a higher F1 score (96% compared to 85%), reflecting a better
overall balance between precision and recall.

It is important to consider that the performance of these models may vary depending
on the specific dataset they are trained on and the types of phishing websites they
encounter.

6.2 INFERENCES

The superior performance of the MLSELM model can be attributed to its multilayer
stacked ensemble architecture, which leverages the collective intelligence of multiple
classifiers to make accurate predictions. By incorporating diverse base classifiers
and meta-learning techniques, MLSELM effectively captures the complex
relationships between features and the target variable, enhancing its discriminative
power. In contrast, while XGBoost is renowned for its scalability and efficiency, its
performance may be limited by its single-layer ensemble approach, which may
struggle to capture intricate patterns in the data.

Furthermore, the success of the MLSELM model emphasizes the importance of

feature selection in enhancing model performance. By identifying and prioritizing
relevant features and optimizing model parameters, we can mitigate the risk of
overfitting and improve the model's generalization capabilities.

33
CHAPTER 7

FUTURE WORK AND ENHANCEMENTS

In envisioning the future enhancements for our phishing website detection model, we
are poised to revolutionize cybersecurity by imbuing it with self-learning and
selfupdating capabilities. By leveraging advanced machine learning algorithms and
innovative techniques, our model will autonomously adapt to emerging threats,
continuously refining its predictive capabilities without the need for manual
intervention. This transformative approach not only ensures real-time protection but
also alleviates the burden on administrators, freeing them from the tedious task of
manual updates.

Moreover, our vision extends beyond mere efficacy to inclusivity, as we aspire to

expand the reach of our phishing detection solution beyond the confines of a single
browser. Through future enhancements to our Chrome extension, we aim to
engineer a versatile tool that transcends browser limitations, offering seamless
integration with a myriad of popular web browsers. This expansion democratizes
access to cutting-edge cybersecurity measures, empowering users across diverse
platforms to safeguard themselves against phishing attacks effectively.

In essence, our commitment to innovation and inclusivity drives us to reimagine the

landscape of cybersecurity, ushering in an era where protection is not only intelligent
and adaptive but also universally accessible. With these future enhancements, we
are poised to make a lasting impact, fortifying digital ecosystems against the
everevolving threat of phishing attacks.

34
REFERENCES

[1] Lizhen Tang; Qusay H. Mahmoud – (2023) "A Deep Learning-Based

Framework for Phishing Website Detection"

[2] Rasha Zieni , Luisa Masari , and Maria Carla Calzarossa – (2023) "Phishing or
Not Phishing? A Survey on the Detection of Phishing Websites"

[3] Yazan A. Al-Sarier, Victor Elijah Adeyemo, Abdullateef O. Balogun and Ammar
K. Alazzawi – (2020) "AI Meta-Learners and Extra-Trees Algorithm for the
Detection of Phishing Websites"

[4] Praveen M, Dhulavvagol Ribhav Ostwal ,S G Totad , S Sudhanshu, Pratheek

P,
Veerabhadra M.Y - (2022) "An Efficient Ensemble Based Model for Data
Classification"

[5] S. Zander et al., (2018) "Machine Learning-Based Phishing Detection: Feature

Selection, False Positive Reduction, and Model Evaluation," in IEEE Transactions on
Dependable and Secure Computing, vol. 15, no. 4, pp. 645-659, doi:
10.1109/TDSC.2017.2724718.

[6] D. Fumarola et al., (2019) "Phishing Detection Using Genetic Programming

with Human-Competitive Performance," in IEEE Transactions on Evolutionary
Computation, vol. 23, no. 3, pp. 390-403, doi: 10.1109/TEVC.2018.2885320.

[7] C. Ma et al., (2020) "Phishing Website Detection Based on Deep Learning

Technique," in IEEE Access, vol. 8, pp. 201565-201576, doi:
10.1109/ACCESS.2020.3039802.

[8] A. Nazir et al., (2017) "Machine Learning-Based Phishing Detection Using URL
and Website Content Features," in Computers & Security, vol. 68, pp. 126-140,
doi:
35
10.1016/j.cose.2017.04.003.

[9] J. Ma et al., (2016) "A Machine Learning-Based Approach for Detecting

Phishing URLs," in Journal of Computer and System Sciences, vol. 82, no. 8,
pp. 1284-1297, doi: 10.1016/j.jcss.2016.04.002.

[10] A. Kumar et al., (2015) "A Review of Machine Learning Approaches to Phishing
Detection," in Procedia Computer Science, vol. 48, pp. 96-104, doi:
10.1016/j.procs.2015.04.197.

[11] L. Liao et al., (2018) "Combating Phishing Using Trusted Features and
Machine
Learning," in Information Sciences, vol. 423, pp. 85-102, doi:
10.1016/j.ins.2017.10.005.

[12] H. Y. Son et al., (2019) "Phishing Website Detection Using Machine Learning
and Features Extracted from Website Images," in Journal of Information
Processing Systems, vol. 15, no. 1, pp. 117-133, doi: 10.3745/JIPS.03.0104.

[13] P. M. Chhabra et al., (2017) "A Machine Learning Approach to Phishing

Detection and Defense," in ACM Transactions on Internet Technology, vol. 17,
no. 4, pp. 1-25, doi: 10.1145/3091628.

[14] A. Shukla et al., (2016) "Machine Learning-Based Phishing Detection

Framework Using Multi-Level Feature Engineering," in Journal of Network and
Computer Applications, vol. 76, pp. 149-159, doi: 10.1016/j.jnca.2016.07.015.

[15] C. Singh et al., (2020) "A Machine Learning Approach for Detecting Phishing
Websites Using Neural Network," in Journal of King Saud University -
Computer and Information Sciences, doi: 10.1016/j.jksuci.2020.07.001.

[16] M. K. Srivastava et al., (2018) "Detection of Phishing Websites Using Machine

Learning Techniques," in International Journal of Computer Applications, vol.
184, no. 20, pp. 21-25, doi: 10.5120/ijca2018917606.

36
[17] G. Li et al., (2019) "Phishing Website Detection Based on URL Features Using
Machine Learning," in IEEE Access, vol. 7, pp. 131577-131588, doi:
10.1109/ACCESS.2019.2936143.

[18] S. A. Alqahtani et al., (2018) "A Novel Approach for Phishing Detection Based
on Ensemble Learning," in International Journal of Advanced Computer
Science and Applications, vol. 9, no. 10, pp. 308-316, doi:
10.14569/IJACSA.2018.091044.

[19] Y. Zhang et al., (2019) "PhishSpy: A Deep Learning-Based Framework for

Phishing Website Detection," in Proceedings of the IEEE International
Conference on Big Data, pp. 2543-2552, doi:
10.1109/BigData47090.2019.9006191.

[20] A. K. Sharma et al., (2015) "Machine Learning Techniques for Phishing

Detection," in International Journal of Computer Science and Information
Security, vol. 13, no. 8, pp. 57-64.

[21] Stacking Explained for Beginners - Ensemble Learning – Youtube video by AI

Sciences: https://www.youtube.com/watch?v=lcXKFS65BI0

[22] Phishing - A game of deception - Cyber security awareness video – Youtube

video by Security Quotient: https://www.youtube.com/watch?
v=WNVTGTrWcvw

[23] Phishing Explained In 6 Minutes - YouTube Video by Simplilearn:

https://www.youtube.com/watch?v=XBkzBrXlle0

[24] How Hackers do Phishing Attacks to hack your accounts - YouTube Video by
Tech Raj: https://www.youtube.com/watch?v=RNzMKEYi2_0

[25] Feature Selection Techniques Easily Explained - YouTube Video by Krish Naik:
https://www.youtube.com/watch?v=EqLBAmtKMnQ
37
[26] What is a Phishing Attack? – Article by
IBM: https://www.ibm.com/topics/phishing

[27] How to Recognize and Avoid Phishing Scams –Article by Federal Trade
Commission:https://consumer.ftc.gov/articles/how-recognize-and-avoid-
phishingscams

[28] Phishing: Spot and report scam emails, texts, websites and calls –Article by
National Cybersecurity Centre - https://www.ncsc.gov.uk/collection/phishing-scams

[29] Multi-layer stacking ensemble learners for low footprint network intrusion
detection – Article by Springer Link:
https://link.springer.com/article/10.1007/s40747022-00809-3

[30] Stacking Ensemble Machine Learning With Python – Article by Machine

Learning Mastery: https://machinelearningmastery.com/stacking-ensemble-
machinelearning-with-python/

38
APPENDIX

A1 - SOURCE CODE

Base.html
<!DOCTYPE html>
<html lang="en">

<link rel="preconnect" href="https://fonts.googleapis.com" />

<title>{% block title %}{% endblock %}</title>

</head>
{% set theme = theme|default('dark') -%}

<body class="body--{{ theme }}">

<main>{% block content %}{% endblock %}</main>
<div class="footer">{% block footer %}{% endblock %}</div>

39

{% block scripts %}
<script src="{{ url_for('static', filename='js/jquery.min.js') }}"></script>
<script src="{{ url_for('static', filename='js/bootstrap.bundle.min.js') }}"></script>
<script src="https://cdn.jsdelivr.net/npm/sweetalert2@11"></script>
{% endblock %}
</body>

</html>

Check.html
{% extends 'base.html' %}
{% block title %}Processing | Phishing Website Detector{% endblock %}

{% set theme = 'dark' -%}

{% block content %}
<div class="card card--result color-bg-dark">
<img class="screenshot--target card-img rounded-0" alt="{{target}}" style="display:
none;">
<div class="screenshot--skeleton placeholder-glow">
<span class="placeholder"></span>
</div>

<div class="card-img-overlay content--wrapper">

<div class="content--area placeholder-glow">
<div class="liquid-ball placeholder">
<div class="ball-inner">
<div class="ball-percent"></div>
<div class="ball-water"></div>
<div class="ball-glare"></div>
</div>
</div>
<button id="web-button" class="btn btn-lg btn-redirect placeholder px-lg-
40
5">Continue to
website</button>
</div>
</div>
</div>
{% endblock %}

{% block scripts %}
{{ super() }} <script> function setTargetScreenshot() {
var targetScreenshot = $(".screenshot--target"); var
skeletonScreenshot = $(".screenshot--skeleton");

targetScreenshot.hide();
skeletonScreenshot.show();

var height = $(window).height();

var width = $(window).width();

targetScreenshot.attr("src", `{{ url_for('screenshot', target=target)

}}&width=${width}&height=${height}`);
targetScreenshot.on('load', function ()
{ skeletonScreenshot.hide();
$(this).show();
});
}

function showResult(data) { var variantInc = 100 / 3; var safe_percentage

= data.safe_percentage; var percentage = Math.max(data.safe_percentage,
data.unsafe_percentage);

if (percentage !== "" && !

isNaN(percentage) && percentage
<= 100 && percentage >= 0) {

41
var waterLevel = 100 - percentage;

$(".ball-percent").append($("<span>").text(percentage.toLocaleString('en-US', {
minimumFractionDigits: 0, maximumFractionDigits: 1,
}) + "%"));
$(".ball-water").css("top", waterLevel + "%");

if (safe_percentage < variantInc * 1) {

$(".content--area").addClass("content--unsafe");
$(".ball-percent").append($("<span>").text("unsafe"));
$(".btn-redirect").attr("onClick", "phishedAlert()");
} else if (safe_percentage < variantInc * 2) {
$(".content--area").addClass("content--doubt");
$(".ball-percent").append($("<span>").text("doubt"));
$(".btn-redirect").attr("onClick", `warningAlert('${data.target}')`);
} else {
$(".content--area").addClass("content--safe");
$(".ball-percent").append($("<span>").text("safe"));
$(".btn-redirect").attr("onClick", `safeAlert('${data.target}')`);
}
} else {
$(".ball-water").css("top", "100%");
$(".ball-percent").text("NaN").css("font-size", "92px");
$(".content--area").addClass("content--doubt");
}

$(".content--area").find(".placeholder").removeClass("placeholder");
$(".content--area").removeClass("placeholder-glow");
}

function phishedAlert()
{ Swal.fire({ title:
"Phished Website!!",
42
text: "It's too dangerous to
continue, hence we can't
allow this action.",
showCancelButton: true,
showConfirmButton: false,
showDenyButton: true,
denyButtonText: 'Back to
home'
}).then((result) => { if (result.isDenied)
{ window.location =
"{{ url_for('home') }}";
}
});
}

function warningAlert(target) { Swal.fire({ title: "Seems unsafe to

me!!", text: "It's too dangerous to continue, hence we can't allow this
action.", showCancelButton: true, showConfirmButton: true,
confirmButtonText: 'Continue anyways'
}).then((result) =>
{ window.open(target, "_blank");
});
}

function safeAlert(target) { Swal.fire({ title:

"Hurray! You're safe", html: "You'll be redirected to
the website...", timer: 3000, timerProgressBar:
true }).then((result) => { if (result.dismiss ===
Swal.DismissReason.timer)
{ window.open(target, "_blank");
}
});
}

43
function invalidAlert(error)
{ Swal.fire({ icon: 'error', title:
'Oops! Something went wrong!',
text: error, input: 'url',
showDenyButton: true,
denyButtonText: 'Back to home',
confirmButtonText: 'Check again',
inputPlaceholder: 'Enter the URL',
allowOutsideClick:
false }).then((result) => { if
(result.isConfirmed)
{ window.location =
`{{ url_for('check') }}?target=${result.value}`;
} else if (result.isDenied)
{ window.location = "{{ url_for('home') }}";
}
});
}

$(function () {

setTargetScreenshot();

$.ajax({ type: "POST",

url: "{{ url_for('check') }}",
contentType: "application/json",
data: JSON.stringify({ target:
"{{target}}",
}),
dataType: "json", success: function (data, status)
{ console.log("Data: " + JSON.stringify(data) + "\nStatus: " +
status); if (data.status) { showResult(data);

44
} else
{ showResult("");
invalidAlert(data.message);
}
},
});
});

$(window).on('resize', function ()
{ setTargetScreenshot();
});
</script>
{% endblock %}

Home.html
<!DOCTYPE html>

<html lang="en" dir="ltr">

<head>
<meta charset="utf-8">
<title>Phishing Detector</title>
<link rel="shortcut icon" href="{{ url_for('static', filename='spam-favicon.ico') }}">
<link rel="stylesheet" type="text/css" href="{{ url_for('static',
filename='styles.css') }}">
<script src="https://kit.fontawesome.com/5f3f547070.js"
crossorigin="anonymous"></script>
<link href="https://fonts.googleapis.com/css2?family=Roboto&display=swap"
rel="stylesheet">

</head>

<body>

45
<a href="/url-detector" class="link">Phishing Url Detector</a>
</header>

<div class="container">
<h1 class='container-heading'><span>Secure Your Organization from
Phishing Attacks</h1>
<div class='description'>
<p>Advanced phishing detection app that utilizes cutting-edge algorithms and
machine learning techniques to identify and prevent phishing attacks targeting
organizations.</p>
</div>
</div>

46
<!-- <div class='footer'>

</div> -->

</body>
</html>

Index.html
{% extends 'base.html' %}
{% block title %}Phising Detector by Invaders{% endblock %}

{% block content %}
<section class="container min-vh-100">
<header>
<a href="/" class="link">Home Page</a>
</header>

<div class="row align-items-center justify-content-center min-vh-100 py-5">

<div class="col-lg-10 my-5">

<div class="d-flex justify-content-center mb-3">
<figure class="text-center">
<blockquote class="blockquote">
<h1 class="fw-bold">Phishing Website Detector</h1>
</blockquote>

</figure>
</div>
<form action="{{ url_for('check') }}" method="get" class="form--home input-group
rounded">
<input type="url" id="target-url" name="target" class="form-control form-
control-lg text-center border border-dark border-2 py-

47
3" placeholder="http://phish-site.com/malicious-url"
required />

<button class="btn btn-dark fs-5 py-3 px-5" type="submit">

Let's find out
</button>
</form>
</div>
<div class="col-10 my-5">
<div class="row text-center">
<div class="col-md py-2">
<div class="card shadow-sm p-4 border border-dark border-2">
<div id="web-visits" class="display-5 fw-bold text-primary"></div>
<p class="fs-5 fw-semibold">Total website visits</p>
</div>
</div>
<div class="col py-2">
<div class="card shadow-sm p-4 border border-dark border-2">
<div id="web-checked" class="display-5 fw-bold text-secondary"></div>
<p class="fs-5 fw-semibold">Total website checked</p>
</div>
</div>
<div class="col py-2">
<div class="card shadow-sm p-4 border border-dark border-2">
<div id="web-phished" class="display-5 fw-bold text-danger"></div>
<p class="fs-5 fw-semibold">Total phished websites</p>
</div>
</div>
</div>
</div>
</div>
</section>
{% endblock %}

48
{% block scripts %}
{{ super() }} <script> var eventSource = new
EventSource("{{ url_for('listen') }}");

eventSource.addEventListener(
"stats", function (e)
{ console.log(e.data);
data = JSON.parse(e.data);
$("#web-visits").text(data.visits);
$("#web-checked").text(data.checked);
$("#web-phished").text(data.phished);
},
true
);
</script>
{% endblock %}

Result.html
<!DOCTYPE html>

<html lang="en" dir="ltr">

<head>
<meta charset="utf-8" />
<title>PHISHING DETECTOR</title>
<link rel="shortcut
icon"
href="{{ url_for('static', filename='spam-favicon.ico') }}"
/>
<link
rel="stylesheet"
type="text/css"
href="{{ url_for('static', filename='styles.css') }}"

49
/>
<script
src="https://kit.fontawesome.com/5f3f547070.js"
crossorigin="anonymous"
></script>
<link
href="https://fonts.googleapis.com/css2?family=Roboto&display=swap"
rel="stylesheet"
/>
</head>

<body>
<div class="results">
<h1>PREDICTION RESULT</h1>
{% if prediction==1 %}
<h2>
<span class="danger"
>Caution! Our system has flagged this message as a possible phishing
attempt</span
>
</h2> <img class="image"
src="{{ url_for('static', filename='unsafe-icon.png') }}"
alt="SPAM Image"
/>
{% elif prediction==0 %}
<h2>
<span class="safe"
>Congratulations! This message is classified as SAFE</span
>
</h2>
<img class="image" src="{{ url_for('static',
filename='safety-icon.png') }}" alt="Not a spam
image"

50
/>
{% endif %}
</div>
</body>
</html>

Content detector.py

# Importing essential libraries from flask import

Flask, render_template, request import pickle

# Load the Multinomial Naive Bayes model and CountVectorizer object from disk
filename = 'spam-sms-mnb-model.pkl' classifier = pickle.load(open(filename,
'rb')) cv = pickle.load(open('cv-transform.pkl','rb')) app = Flask(__name__)

@app.route('/') def
home():
return render_template('home.html')

@app.route('/predict',methods=['POST'])
def predict():
if request.method == 'POST':
message =
request.form['message'] data =
[message] vect =
cv.transform(data).toarray()
my_prediction =
classifier.predict(vect) return
render_template('result.html',
prediction=my_prediction)

if __name__ == '__main__':
app.run(debug=True)

51
Features.py
# Exraction of features from the URL
# 0 having_IP_Address
# 1 URL_Length
# 2 Shortining_Service
# 3 having_At_Symbol
# 4 double_slash_redirecting
# 5 Prefix_Suffix
# 6 having_Sub_Domain
# 7 URL_Depth
# 8 Domain_registeration_length
# 9 Favicon
# 10 port
# 11 HTTPS_token
# 12 Request_URL
# 13 URL_of_Anchor
# 14 Links_in_tags
# 15 SFH
# 16 Submitting_to_email
# 17 Abnormal_URL
# 18 Redirect
# 19 on_mouseover
# 20 RightClick
# 21 popUpWidnow
# 22 Iframe
# 23 age_of_domain
# 24 DNSRecord
# 25 web_traffic
# Above fetatures function returns
# 1 if the URL is Phishing,
# -1 if the URL is Legitimate and
# 0 if the URL is Suspicious

52
import re import whois import
datetime import requests import
ipaddress from dns import
resolver from bs4 import
BeautifulSoup from urllib.parse
import urlparse

class FeatureExtraction:
def __init__(self, url):
self.url = url self.parsedurl =
urlparse(self.url) self.domain =
self.parsedurl.netloc
try
:
self.whois = whois.whois(self.domain)
except:
self.whois = None
try
:
self.request = requests.get(self.url, timeout=5, headers={
"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64
12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141
Safari/537.36"})
self.soup = BeautifulSoup(self.request.content, 'html.parser')
except:

self.request = None
self.soup = None

r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \

def getFeaturesDict(self):
return {
"having_IP_Address": self.having_IP_Address(),
"URL_Length": self.URL_Length(),
"Shortining_Service": self.Shortining_Service(),
"having_At_Symbol": self.having_At_Symbol(),
"double_slash_redirecting": self.double_slash_redirecting(),
"Prefix_Suffix": self.Prefix_Suffix(),
"having_Sub_Domain": self.having_Sub_Domain(),
"URL_Depth": self.URL_Depth(),
"Domain_registeration_length": self.Domain_registeration_length(),
"Favicon": self.Favicon(),
"port": self.port(),
"HTTPS_token": self.HTTPS_token(),
"Request_URL": self.Request_URL(),
"URL_of_Anchor": self.URL_of_Anchor(),
"Links_in_tags": self.Links_in_tags(),
"SFH": self.SFH(),

54
"Submitting_to_email": self.Submitting_to_email(),
"Abnormal_URL": self.Abnormal_URL(),
"Redirect": self.Redirect(),
"on_mouseover": self.on_mouseover(),
"RightClick": self.RightClick(),
"popUpWidnow": self.popUpWidnow(),
"Iframe": self.Iframe(),
"age_of_domain": self.age_of_domain(),
"DNSRecord": self.DNSRecord(),
"web_traffic": self.web_traffic()
}

"""#### IP Address in the URL

Checks for the presence of IP address in the URL. URLs may have IP address
instead of domain name. If an IP address is used as an alternative of the domain
name in the URL, we can be sure that someone is trying to steal personal
information with this URL.
If the domain part of URL has IP address, the value assigned to this feature is 1
(phishing) or else -1 (legitimate).
"""

def having_IP_Address(self):
try:
ipaddress.ip_address(self.domain)
return 1 except:
return -1

"""#### Length of URL

Computes the length of the URL. Phishers can use long URL to hide the doubtful
part in the address bar. In this project, if the length of the URL is greater than or
equal 54 characters then the URL classified as phishing otherwise legitimate. If
the length of URL >= 54 , the value assigned to this feature is 1 (phishing) or else 0
(suspicious) else -1 (legitimate).

55
"""

def URL_Length(self):
if len(self.url) < 54:
return -1 elif len(self.url) >= 54 and
len(self.url) <= 75:
return 0
else:
return 1
"""#### ** Using URL Shortening Services “TinyURL” **
URL shortening is a method on the “World Wide Web” in which a URL may be
made considerably smaller in length and still lead to the required webpage. This is
accomplished by means of an “HTTP Redirect” on a domain name that is short,
which links to the webpage that has a long URL.
If the URL is using Shortening Services, the value assigned to this feature is 1
(phishing) or else -1 (legitimate).
"""

def Shortining_Service(self): if
re.search(self.shortening_services, self.url):
return 1
else:
return -1

"""#### "@" Symbol in URL

Checks for the presence of '@' symbol in the URL. Using “@” symbol in the URL
leads the browser to ignore everything preceding the “@” symbol and the real
address often follows the “@” symbol.
If the URL has '@' symbol, the value assigned to this feature is 1 (phishing) or else
-1 (legitimate).
"""

56
def having_At_Symbol(self):
if '@' in self.url:
return 1
else:
return -1

"""#### Redirection "//" in URL

Checks the presence of "//" in the URL. The existence of “//” within the URL path
means that the user will be redirected to another website. The location of the “//” in
URL is computed.
If the "//" is anywhere in the URL apart from after the protocal, thee value assigned
to this feature is 1 (phishing) or else -1 (legitimate).
"""

def double_slash_redirecting(self):
if re.search(r'https?://[^\s]*//', self.url):
return 1
else:
return -1

"""#### Prefix or Suffix "-" in Domain

Checking the presence of '-' in the domain part of URL. The dash symbol is rarely
used in legitimate URLs. Phishers tend to add prefixes or suffixes separated by (-) to
the domain name so that users feel that they are dealing with a legitimate webpage.
If the URL has '-' symbol in the domain part of the URL, the value assigned to this
feature is 1 (phishing) or else -1 (legitimate).
"""

def Prefix_Suffix(self):
if '-' in self.domain:
return 1
else:
return -1

57
"""#### ** SubDomains **
If the URL has more than 2 subdomains, the value assigned to this feature is 1
(phishing) or else 0 (suspicious) else -1 (legitimate).
"""

def having_Sub_Domain(self):
count = self.domain.count('.')
if count <= 2: return -1
elif count > 2 and count <= 3:
return 0
else:
return 1

"""#### Depth of URL

Computes the depth of the URL. This feature calculates the number of sub pages
in the given url based on the '/'.
The value of feature is a numerical based on the URL.
"""

def URL_Depth(self):
depth = 0 subdirs =
self.parsedurl.path.split('/') for
subdir in subdirs: if subdir:
depth += 1 return depth

"""#### End Period of Domain

This feature can be extracted from WHOIS database. For this feature, the
remaining domain time is calculated by finding the different between expiration time
& current time. The end period considered for the legitimate domain is 6 months or
more for this project.
If end period of domain < 6 months, the vlaue of this feature is 1 (phishing) else -1
(legitimate).

58
"""

def Domain_registeration_length(self):
if self.whois is None:
return 1
try: if
type(self.whois['expiration_date']) is list:
expiration_date = self.whois['expiration_date'][0]
else:
expiration_date = self.whois['expiration_date']

registration_length = abs(
(expiration_date - datetime.datetime.now()).days)
if registration_length / 30 >= 6:
return -1
else:
return 1
except:
return 1

"""#### ** Favicon **
Checks for the presence of favicon in the website. The presence of favicon in the
website can be used as a feature to detect phishing websites.
If the website has favicon, the value assigned to this feature is 1 (phishing) or else
-1 (legitimate).

"""

def Favicon(self):
try:
if re.findall(r'favicon', self.soup.text) or \
self.soup.find('link', rel='shortcut icon') or \
self.soup.find('link', rel='icon'):

59
return -1
else:
return 1
except:
return 1

"""#### Non-Standard Port

Checks for the use of non-standard port. Phishers often use non-standard ports in
the URL in order to make it look like a legitimate one.
If the URL uses non-standard port, the value assigned to this feature is 1
(phishing) or else -1 (legitimate).
"""

def port(self): if
self.parsedurl.port:
return 1
else:
return -1

"""#### "http/https" in Domain name

Checks for the presence of "http/https" in the domain part of the URL. The
phishers may add the “HTTPS” token to the domain part of a URL in order to
trick users.
If the URL has "http/https" in the domain part, the value assigned to this feature is
1 (phishing) or else -1 (legitimate).
"""

def HTTPS_token(self):
if 'https' in self.domain:
return 1
else:
return -1

"""### ** Request_URL **
60
The fine line that distinguishes phishing websites from legitimate ones is how
many times a website has been redirected. In our dataset, we find that legitimate
websites have been redirected one time max. On the other hand, phishing websites
containing this feature have been redirected at least 4 times.
"""

def Request_URL(self): try:

if len(self.request.history) <= 1:
return -1 elif
len(self.request.history) <= 3:
return 0
else:
return 1
except:
return -1

"""#### ** URL_of_Anchor **
The presence of “<a>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<a>” tag in the URL.
If the URL has “<a>” tag, the value assigned to this feature is 1 (phishing) or else
1 (legitimate).
"""

def URL_of_Anchor(self):
try:
count = 0
for i in self.soup.find_all('a'):
if i.has_attr('href'):
count += 1
if count == 0:
return 1 else:

61
return -1
except:
return 1

"""#### ** Links_in_tags **
The presence of “<link>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<link>” tag in the URL.
If the URL has “<link>” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).
"""

def Links_in_tags(self):
try:
count = 0 for i in
self.soup.find_all('link'): if
i.has_attr('href'):
count += 1
if count == 0:
return 1 else:
return -1
except:
return 1

"""#### ** SFH **
The presence of “<form>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<form>” tag in the URL.

If the URL has “<form>” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).
"""

def SFH(self): try:

if self.soup.find('form'):

62
return 1
else:
return -1
except:
return 0

"""#### ** Submitting_to_email **
The presence of “mailto:” in the URL is a strong indicator of phishing websites.
This feature checks for the presence of “mailto:” in the URL.
If the URL has “mailto:” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).
"""

def Submitting_to_email(self):
try:
if self.soup.find('mailto:'):
return 1
else:
return -1
except:
return 0

"""#### ** Abnormal_URL **
The presence of “<script>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<script>” tag in the URL.
If the URL has “<script>” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).

"""

"""#### ** Redirect **
The presence of “<meta>” HTML tag in the URL is a strong indicator of phishing
websites. This feature checks for the presence of “<meta>” tag in the URL.
If the URL has “<meta>” tag, the value assigned to this feature is 1 (phishing) or
else -1 (legitimate).
"""

def Redirect(self):
try:
if self.soup.find('meta', attrs={'http-equiv': 'refresh'}):
return 1
else:
return -1
except:
return -1

"""### 3.3.2. Status Bar Customization

Phishers may use JavaScript to show a fake URL in the status bar to users. To
extract this feature, we must dig-out the webpage source code, particularly the
“onMouseOver” event, and check if it makes any changes on the status bar
If the response is empty or onmouseover is found then, the value assigned to this
feature is 1 (phishing) or else 0 (legitimate).
"""

def on_mouseover(self): try: if

re.findall(r"onmouseover", self.soup.text):

64
return 1
else:
return -1
except:
return -1

"""### Disabling Right Click

Phishers use JavaScript to disable the right-click function, so that users cannot
view and save the webpage source code. This feature is treated exactly as “Using
onMouseOver to hide the Link”. Nonetheless, for this feature, we will search for event
“event.button==2” in the webpage source code and check if the right click is
disabled.
If the response is empty or onmouseover is not found then, the value assigned to
this feature is 1 (phishing) or else 0 (legitimate).
"""

def RightClick(self): try: if re.findall(r"contextmenu|

event.button ?== ?2", self.soup.text): return 1 else:
return -1
except:
return -1

"""### PopUp Window

Phishers may use JavaScript to open a fake webpage in a new window to trick
users. This feature is treated exactly as “Using onMouseOver to hide the Link”.
Nonetheless, for this feature, we will search for event “window.open” in the webpage
source code and check if the pop-up window is opened.
If the response is empty or onmouseover is not found then, the value assigned to
this feature is 1 (phishing) or else 0 (legitimate).
"""

65
def popUpWidnow(self): try: if re.findall(r"alert\(|
onMouseOver|window.open", self.soup.text): return 1
else:
return -1
except:
return -1

"""### IFrame Redirection

IFrame is an HTML tag used to display an additional webpage into one that is
currently shown. Phishers can make use of the “iframe” tag and make it invisible i.e.
without frame borders. In this regard, phishers make use of the “frameBorder”
attribute which causes the browser to render a visual delineation.
If the iframe is empty or repsonse is not found then, the value assigned to this
feature is 1 (phishing) or else -1 (legitimate).
"""

def Iframe(self): try: if re.findall(r"[<iframe>|

<frameBorder>]", self.soup.text):
return 1
else:
return -1
except:
return -1

"""#### Age of Domain

This feature can be extracted from WHOIS database. Most phishing websites live
for a short period of time. The minimum age of the legitimate domain is considered to
be 12 months for this project. Age here is nothing but different between creation and
expiration time.
If age of domain > 12 months, the vlaue of this feature is 1 (phishing) else -1
(legitimate).
"""

66
def age_of_domain(self):
if self.whois is None:
return 1
try: if
type(self.whois['creation_date']) is list:
creation_date = self.whois['creation_date'][0]
else:
creation_date = self.whois['creation_date']

ageofdomain = abs((datetime.datetime.now() - creation_date).days)

if ageofdomain / 30 > 12:
return -1
else:
return 1
except:
return 1

"""#### DNS Record

For phishing websites, either the claimed identity is not recognized by the WHOIS
database or no records founded for the hostname.
If the DNS record is empty or not found then, the value assigned to this feature is
1 (phishing) or else -1 (legitimate).
"""

def DNSRecord(self):
try:
resolver.resolve(self.domain, 'A')
return -1 except: return 1

"""#### Web Traffic

This feature measures the popularity of the website by determining the number of
visitors and the number of pages they visit. However, since phishing websites live for
a short period of time, they may not be recognized by the Alexa database (Alexa the

67
Web Information Company., 1996). By reviewing our dataset, we find that in worst
scenarios, legitimate websites ranked among the top 100,000. Furthermore, if the
domain has no traffic or is not recognized by the Alexa database, it is classified as
“Phishing”.
If the rank of the domain < 100000, the vlaue of this feature is 1 (phishing) else -1
(legitimate).
"""

def web_traffic(self):
try:
alexadata = BeautifulSoup(requests.get(
"http://data.alexa.com/data?cli=10&dat=s&url=" + self.domain,
timeout=10).content, 'lxml') rank = int(alexadata.find('reach')
['rank']) if rank < 100000:
return -1
else:
return 1
except:
return 1

Helper.py from features import

FeatureExtraction from flask import
send_from_directory from html2image
import Html2Image from urllib.parse
import urlparse import pandas as pd
import validators import pickle import
re import os

screenshot_dir = 'screenshot/' stats_params

= ('visits', 'checked', 'phished')

68
h2i = Html2Image()
h2i.output_path = screenshot_dir

stats_filename = 'stats.txt' model =

pickle.load(open("model.pkl", "rb"))

def format_url(url): url = url.strip() if

not re.match('(?:http|ftp|https)://', url):
return 'http://{}'.format(url)
return url

def capture_screenshot(target_url, filename='screenshot.png', size=(1920, 1080)):

h2i.screenshot(url=target_url, save_as=filename, size=size)
return send_from_directory(screenshot_dir, path=filename) def
get_phishing_result(target_url): target_url = format_url(target_url)
if not (target_url and validators.url(target_url)):
return dict(status=False, message="You have provided an invalid target url,
Please try again after updating the url.")
try
:
update_stats('checked')
target = urlparse(target_url)

features_obj = FeatureExtraction(target_url) x =
pd.DataFrame.from_dict(features_obj.getFeaturesDict(), orient='index').T

pred = model.predict(x)[0] # 1 is phished & 0 is not

69
pred_prob = model.predict_proba(x)[0]
safe_prob = pred_prob[0] unsafe_prob
= pred_prob[1]

if pred == 1:
update_stats('phished')
return dict( status=True,
domain=target.netloc,
target=target_url,
safe_percentage=safe_prob*100,
unsafe_percentage=unsafe_prob*100
)
except Exception as e:
return dict(status=False, message=str(e))

def get_stats(key=None):
stats = {} if
os.path.exists(stats_filename): with
open(stats_filename, "r") as file:
for line in file: (k, v) =
line.split(":") stats[k] = int(v)

if key is not None:

return stats[key] if key in stats else None

return stats

return False

70
def update_stats(key): stats =
get_stats() with open("stats.txt",
"w+") as file: if stats is False:
file.write('\n'.join([f"{x}:0" for x in stats_params]))
else:
lines = [] avail_params =
list(stats_params) for k, v in
stats.items():
avail_params.remove(k) if k ==
key: v += 1
lines.append(f"{k}:{v}") if
len(avail_params) > 0: for
param in avail_params:
lines.append(f"{param}:0")
file.write("\n".join(lines)) file.flush()
Main.py from flask import
Flask from url_detector import
* from content_detector import
*

app = Flask(__name__)

# Routes from app.py

@app.route('/') def
home():
return render_template('home.html')

@app.route('/predict', methods=['POST'])
def predict(): if request.method ==
'POST':
message = request.form['message'] data = [message]
vect = cv.transform(data).toarray() my_prediction =

71
classifier.predict(vect) return render_template('result.html',
prediction=my_prediction)

# Routes from main.py

@app.route('/url-detector') def
url():
update_stats('visits') return
render_template("index.html")

@app.route('/check', methods=['GET', 'POST'])

def check():
update_stats('visits') if
request.method == "POST":
target_url = request.json['target']
result = get_phishing_result(target_url=target_url)
return jsonify(result)

target_url = request.args.get("target") return

render_template('check.html', target=target_url)

@app.route("/listen") def
listen(): def
respond_to_client():
while True:
stats = get_stats()
_data = json.dumps(
{"visits": stats['visits'], "checked": stats['checked'], "phished":
stats['phished']}) yield f"id: 1\ndata: {_data}\nevent: stats\n\n"
time.sleep(0.5)

return Response(respond_to_client(), mimetype='text/event-stream')

@app.route("/screenshot") def
screenshot():
72
query = request.args if query
and query.get("target"):
target_url = query.get("target")
today_date = date.today()

width = default_screenshot_width
height = default_screenshot_height

if query.get("width") and query.get("height"):

width = int(query.get("width"))
height = int(query.get("height"))

ss_file_name = secure_filename(f"{target_url}-{today_date}-
{width}x{height}.png")
ss_file_path = os.path.join(screenshot_dir, ss_file_name)

if os.path.exists(ss_file_path):
return send_from_directory(screenshot_dir, path=ss_file_name)

return capture_screenshot(target_url=target_url, filename=ss_file_name,

size=(width, height)) abort(404)

if __name__ == '__main__':
app.run(debug=True)

Url detector.py from flask import Flask, Response, render_template, request,

send_from_directory, jsonify, abort from helper import get_phishing_result, get_stats,
update_stats, capture_screenshot, screenshot_dir from werkzeug.utils import
secure_filename from datetime import date import time import json import os

app = Flask(__name__)
app.secret_key = os.urandom(12).hex()
73
default_screenshot_width = 1920
default_screenshot_height = 1080

@app.route('/') def
home():
update_stats('visits') return
render_template("index.html")
@app.route('/check', methods=['GET',
'POST']) def check():
update_stats('visits') if
request.method == "POST":
target_url = request.json['target'] result =
get_phishing_result(target_url=target_url) return
jsonify(result) target_url = request.args.get("target")
return render_template('check.html', target=target_url)

@app.route("/listen") def
listen():

def respond_to_client():
while True:
stats = get_stats()
_data = json.dumps(
{"visits": stats['visits'], "checked": stats['checked'], "phished":
stats['phished']}) yield f"id: 1\ndata: {_data}\
nevent: stats\n\n" time.sleep(0.5)

return Response(respond_to_client(), mimetype='text/event-stream')

74
@app.route("/screenshot") def
screenshot():
query = request.args if query
and query.get("target"):
target_url = query.get("target")
today_date = date.today()
width = default_screenshot_width
height = default_screenshot_height

if query.get("width") and query.get("height"):

width = int(query.get("width"))
height = int(query.get("height"))

ss_file_name = secure_filename(f"{target_url}-{today_date}-
{width}x{height}.png") ss_file_path =
os.path.join(screenshot_dir, ss_file_name)

if os.path.exists(ss_file_path):
return send_from_directory(screenshot_dir, path=ss_file_name)

return capture_screenshot(target_url=target_url,
filename=ss_file_name, size=(width, height)) abort(404)

if __name__ == '__main__':
app.run(debug=True)

75
A2 – SCREENSHOTS

Figure A2.1 Home page of the website

76
Figure A2.2 URL Detector page of the website

This image also shows the number of website visits, the total number of websites
checked, and the total number of phishing websites detected out of the total
websites checked.

Figure A2.3 Valid phishing url has been pasted

Figure A2.4 Valid phishing url has been detected

77
Figure A2.5 Invalid phishing URL has been pasted

Figure A2.6 Invalid phishing URL has been detected

78
79

Flight Fare Prediction Final
No ratings yet
Flight Fare Prediction Final
65 pages
Final Doc Fin
No ratings yet
Final Doc Fin
87 pages
DOCUMENT
No ratings yet
DOCUMENT
63 pages
Chapter 1-5 DETECTING PHISHING WEBSITES USING MACHINE LEARNING
No ratings yet
Chapter 1-5 DETECTING PHISHING WEBSITES USING MACHINE LEARNING
140 pages
Crime Prediction and Analysis Using Machine Learning
No ratings yet
Crime Prediction and Analysis Using Machine Learning
11 pages
Predicting Health Insurance Claim Frauds Using Machine Learning
No ratings yet
Predicting Health Insurance Claim Frauds Using Machine Learning
11 pages
1822 B.tech It Batchno 358
No ratings yet
1822 B.tech It Batchno 358
119 pages
Analysis On Credit Card Fraud Detection Using Machine Learning Approaches
No ratings yet
Analysis On Credit Card Fraud Detection Using Machine Learning Approaches
10 pages
Coronavirus Disease (Covid-19) Cases Analysis Using Machine Learning
No ratings yet
Coronavirus Disease (Covid-19) Cases Analysis Using Machine Learning
11 pages
Sat - 33.Pdf - Recognition and Listing of Acute Stroke Progression Based On Oct Images Using Curvelet Analysis
No ratings yet
Sat - 33.Pdf - Recognition and Listing of Acute Stroke Progression Based On Oct Images Using Curvelet Analysis
11 pages
Detecting Phishing Websites
100% (1)
Detecting Phishing Websites
65 pages
Malware Detection Using Machine Learning
No ratings yet
Malware Detection Using Machine Learning
112 pages
1822-b.e-cse-batchno-103
No ratings yet
1822-b.e-cse-batchno-103
80 pages
Cryptocurrency Price Prediction Using Deep Learning
No ratings yet
Cryptocurrency Price Prediction Using Deep Learning
52 pages
66
No ratings yet
66
82 pages
content part_merged
No ratings yet
content part_merged
76 pages
Weather Forecast ML
No ratings yet
Weather Forecast ML
55 pages
FINALREPORTCHETHAN
No ratings yet
FINALREPORTCHETHAN
41 pages
New Report
No ratings yet
New Report
73 pages
PHISHING WEBSITE DETECTION
No ratings yet
PHISHING WEBSITE DETECTION
63 pages
Team 4 Report Document (3)
No ratings yet
Team 4 Report Document (3)
72 pages
Anas Index
No ratings yet
Anas Index
3 pages
Documentation Project
No ratings yet
Documentation Project
48 pages
Phishing Website Detection
No ratings yet
Phishing Website Detection
62 pages
1822 B.tech It Batchno 340
No ratings yet
1822 B.tech It Batchno 340
48 pages
Agriculture Crop Recommendation System Using Machine Learning
No ratings yet
Agriculture Crop Recommendation System Using Machine Learning
11 pages
Automation Detection of Malware and Stenographical Content Using Machine Learning
No ratings yet
Automation Detection of Malware and Stenographical Content Using Machine Learning
11 pages
Gokul
No ratings yet
Gokul
84 pages
Yuvan
No ratings yet
Yuvan
42 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
28 pages
Roshini Project
No ratings yet
Roshini Project
74 pages
Ensemble Approach On Customer Churn Prediction
No ratings yet
Ensemble Approach On Customer Churn Prediction
11 pages
Agriculture Crop Recommendation System Using
No ratings yet
Agriculture Crop Recommendation System Using
57 pages
Rajat Naik
No ratings yet
Rajat Naik
53 pages
acknowledgment skin lesion
No ratings yet
acknowledgment skin lesion
8 pages
Final Main Report 1
No ratings yet
Final Main Report 1
68 pages
Social Media Analysis Using Machine Learning
No ratings yet
Social Media Analysis Using Machine Learning
11 pages
Facemask Detection Using Convolutional Neural Networks
No ratings yet
Facemask Detection Using Convolutional Neural Networks
11 pages
Final Document Recent f5
No ratings yet
Final Document Recent f5
52 pages
Final Document Recent f4
No ratings yet
Final Document Recent f4
52 pages
Mini Project Documentation
No ratings yet
Mini Project Documentation
38 pages
Mini Project
No ratings yet
Mini Project
65 pages
Intern Report Progress
No ratings yet
Intern Report Progress
59 pages
report
No ratings yet
report
39 pages
A Novel Image Style Transfer Model Using Generative AI
No ratings yet
A Novel Image Style Transfer Model Using Generative AI
72 pages
PDL Lab 4
No ratings yet
PDL Lab 4
32 pages
Pratham Content
No ratings yet
Pratham Content
43 pages
Guidelines For Preparing Major Project Phase I Documentation 18-22 Batch
No ratings yet
Guidelines For Preparing Major Project Phase I Documentation 18-22 Batch
14 pages
REPORT HFP
No ratings yet
REPORT HFP
71 pages
Crime Prediction Model Using Artificial Neural Network
No ratings yet
Crime Prediction Model Using Artificial Neural Network
53 pages
Batch Num 11 PDF
No ratings yet
Batch Num 11 PDF
86 pages
Report
No ratings yet
Report
42 pages
Intershipdocument 18881A12A5
No ratings yet
Intershipdocument 18881A12A5
32 pages
Table of Contents
No ratings yet
Table of Contents
4 pages
Sat - 67.Pdf - Human Activity Recognition With Smartphones Using Machine Learning Process
No ratings yet
Sat - 67.Pdf - Human Activity Recognition With Smartphones Using Machine Learning Process
11 pages
Screenshot 2024-05-20 at 5.35.42 PM
No ratings yet
Screenshot 2024-05-20 at 5.35.42 PM
59 pages
Sat - 19.Pdf - Prediction of Network Attacks Using Superrvised Machine Learning Algorithm
No ratings yet
Sat - 19.Pdf - Prediction of Network Attacks Using Superrvised Machine Learning Algorithm
11 pages
1922 B.SC Cs Batchno 38
No ratings yet
1922 B.SC Cs Batchno 38
61 pages
CISM Certified Information Security Manager Study Guide
From Everand
CISM Certified Information Security Manager Study Guide
Mike Chapple
4/5 (1)
CompTIA CySA+ Study Guide: Exam CS0-003
From Everand
CompTIA CySA+ Study Guide: Exam CS0-003
Mike Chapple
2/5 (1)
Final Year Project Report Copy1_241218_012855
No ratings yet
Final Year Project Report Copy1_241218_012855
21 pages
Data Science Bootcamp: Curriculum
No ratings yet
Data Science Bootcamp: Curriculum
13 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
4 pages
Mean-Field-Type Games for Engineers 1st Edition Barreiro-Gomez all chapter instant download
100% (3)
Mean-Field-Type Games for Engineers 1st Edition Barreiro-Gomez all chapter instant download
40 pages
End Semester Arrear Theory Examinations Time Table - Aprilmay 2024.PDF.01.04.2024
No ratings yet
End Semester Arrear Theory Examinations Time Table - Aprilmay 2024.PDF.01.04.2024
354 pages
Paper Demo
No ratings yet
Paper Demo
9 pages
Mastering Python
No ratings yet
Mastering Python
17 pages
SDLC REPORT 12
No ratings yet
SDLC REPORT 12
9 pages
All Projects S24
No ratings yet
All Projects S24
150 pages
Time Series Forecasting RNN
No ratings yet
Time Series Forecasting RNN
13 pages
Data Augmentation Techniques I
No ratings yet
Data Augmentation Techniques I
23 pages
Machine Learning
No ratings yet
Machine Learning
122 pages
T DEV 810 - Project
No ratings yet
T DEV 810 - Project
5 pages
Jurnal CNN Pneumonia
No ratings yet
Jurnal CNN Pneumonia
5 pages
Accident Detection Using Deep Learning
No ratings yet
Accident Detection Using Deep Learning
4 pages
nie-et-al-2024-active-learning-guided-discovery-of-high-entropy-oxides-featuring-high-h2-production
No ratings yet
nie-et-al-2024-active-learning-guided-discovery-of-high-entropy-oxides-featuring-high-h2-production
10 pages
IITPatna_AIML_Brochure_V2
100% (1)
IITPatna_AIML_Brochure_V2
10 pages
Iml Gtu Imp
No ratings yet
Iml Gtu Imp
1 page
Jahnavi_IITPATNA
100% (1)
Jahnavi_IITPATNA
1 page
Text Classification Using Machine Learning Techniq
No ratings yet
Text Classification Using Machine Learning Techniq
10 pages
News Generator Bot Rpa
No ratings yet
News Generator Bot Rpa
42 pages
A Critical Review of Factors Influencing The Remaining Driving Range of Electric Vehicles
No ratings yet
A Critical Review of Factors Influencing The Remaining Driving Range of Electric Vehicles
6 pages
Artificial Intelligence Chapter 2: Intelligent Agents
No ratings yet
Artificial Intelligence Chapter 2: Intelligent Agents
12 pages
Major Project Final Report
No ratings yet
Major Project Final Report
19 pages
10 1109@iccubea 2018 8697439
No ratings yet
10 1109@iccubea 2018 8697439
6 pages
Hu Y. AI Techniques in EV Motor and Inverter Fault Detection and Diagnosis 2024
No ratings yet
Hu Y. AI Techniques in EV Motor and Inverter Fault Detection and Diagnosis 2024
293 pages
ML Question
No ratings yet
ML Question
2 pages
Gradient Descent in Linear Regression
No ratings yet
Gradient Descent in Linear Regression
30 pages
History of A I Poster Final
No ratings yet
History of A I Poster Final
1 page
Speech and Text Emotion Recognition Using Machine Learning Batch Number - 08 First Review 2.0
No ratings yet
Speech and Text Emotion Recognition Using Machine Learning Batch Number - 08 First Review 2.0
12 pages