0% found this document useful (0 votes)

31 views5 pages

Scalable Malicious URL Classification: Leveraging Lexical Analysis and API Integration

The document presents a hybrid approach for detecting malicious URLs using machine learning and API integration, specifically focusing on phishing attacks. It utilizes lexical features and the Urlscan.io API to enhance detection accuracy and scalability, allowing users to classify URLs as benign or malicious. The proposed system aims to provide real-time analysis and continuous improvement through ongoing training with new data.

Uploaded by

relliganesh16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views5 pages

Scalable Malicious URL Classification: Leveraging Lexical Analysis and API Integration

Uploaded by

relliganesh16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Scalable Malicious URL Classification:

Leveraging Lexical Analysis and API

Integration

Sri Devi Sameera Mekhala Kota Jasper Surya Chowdary Jasti Nithin Ranganayakulu
Dept of CSE Dept of CSE Dept of CSE
Dhanekula Institute of Dhanekula Institute of Dhanekula Institute of
Engineering and Technology Engineering and Technology Engineering and Technology
Vijayawada, India Vijayawada, India Vijayawada, India
[email protected] [email protected] [email protected]

Meghana Koppula Chittabathina Venkata Sai Poojitha

Dept of CSE Dept of CSE
Dhanekula Institute of Engineering and Technology Dhanekula Institute of Engineering and Technology
Vijayawada, India Vijayawada, India
[email protected] [email protected]

Abstract—In the digital age, the proliferation of malicious Uniform Resource Locators(URLs) pose a significant cybersecurity
threat, enabling phishing attacks, malware infections, and data breaches that jeopardize sensitive information and user safety.
Traditional methods using static blacklists and heuristics are ineffective against evolving threats. This study proposes a
machine learning-based approach to enhance malicious URL detection, utilizing nine lexical features for efficient phishing
attack identification.
Leveraging the ISCXURL-2016 dataset, comprising 11,964 legitimate and phishing URLs, our model demonstrates effective
detection capabilities suitable for resourceconstrained devices. The classification model accurately distinguishes between
benign and malicious URLs by examining URL-derived features, metadata, and Application Program Interface(API)
integration.
This approach addresses the limitations of existing methods, which rely on numerous features requiring substantial processing
power. The objective is to develop a dynamic, scalable system capable of identifying new, unseen malicious URLs with high
accuracy.

Keywords—Malicious URL, Machine learning(ML), Random forest, Convolutional neural networks(CNN), k-nearest
neighbors(KNN), Blacklisting, API.

I. INTRODUCTION : detection of such websites requires scalable and

effective solutions, especially since phishing 1tactics
Phishing attacks are a constant threat to online are continually evolving. This paper, titled \"Scalable
security. These attacks target users to steal URL Classification: Leveraging Lexical Features and
sensitive data through fraudulent websites. The API Integration," presents a hybrid approach for

1
phishing website detection using lexical features
and API integration. III. PROPOSED MODEL:

The approach integrates two methods of This phishing detection model utilizes two
classifying URLs, including a machine learning complementing techniques: classification with the help of
(ML) model that analyses the lexical features machine learning, and external URL scanning with the
from the website content and an external service help of the Urlscan.io API. This hybrid approach
Urlscan.io, that delivers detailed URL scanning presented here is going to enhance the scalability and
results. The ML model utilizes supervised accuracy of phishing detection with the power of machine
learning algorithms, for instance, Gaussian Naive learning and real-time threat intelligence provided by the
Bayes and Random Forest, for phishing attempts Urlscan.io service.
based on the structure and content of web pages.
The Urlscan.io API, however, scans a URL 1. Machine Learning Model: The machine learning
against a huge database of known phishing sites, component of the proposed system is supposed to
which gives an additional layer of detection. classify a website as phishing or not, based on
lexical and structural features extracted from the
This system, implemented in a Streamlit web content of the webpage. The model processes
application, allows users to input a URL and URLs by extracting relevant features from the
choose between the ML model or the API for Hyper Text Markup Lnaguage(HTML) source
phishing detection. Combining these techniques, code of the website, which are then used to train
the paper shows how scalable and flexible various supervised learning algorithms.
solutions can improve the accuracy and
efficiency of phishing detection, providing a Feature Extraction: For detecting phishing websites,
robust tool for enhancing online security. the system focuses on lexical features such as

II. LITERATURE REVIEW: Domain Name Features: The length of the domain
name, the existence of suspicious words such as
1. Malicious URL Detection using Machine "login," "secure," and similarity to popular legitimate
Learning: A Survey” Doyen Sahoo, Chengaho sites.
Liu, Steven C.H Hoi, -Salesforce Research
Asia(2018). Discussed how URL attacks have URL Structure: Number of subdirectories, presence
climbed the top cyber attacks and also discuss of HTTPS, and presence of unusual characters in the
about the scope of effective machine learning URL.
algorithms to detect such URLs.
Content Features: HTML structure, presence of
hidden iframes, suspicious forms, and misleading
2. “Classification of Malicious URLs Using hyperlinks that indicate phishing sites.
Machine Learning” Shayan Abad, Hassan
These features are then fed through the
Gholamy, Mohammad Aslani(2023). Provides in
BeautifulSoup library to parse the HTML content of a
depth analysis about usage of machine learning
given page, and a feature vector is created for each
models and their impact on recall, precision and
URL.
F1 scores.
Supervised Learning Models
3. “Malicious URL Detection based on Machine
2Learning” Do Cho, Hoa Dinh, Tisenko
To classify the URLs, various supervised machine
Victor(2023). Explains various techniques and learning algorithms have been used, among which
micro tweakings for algorithm parameters which are: Gaussian Naive Bayes: It is a probabilistic
significantly impact the outcome such as number classifier which applies Bayes' theorem; useful when
of trees to be used for decision tree model, and features are conditionally independent.
also how certain libraries such as SparkML can
significantly improve the training and testing SVM is a classifier that works well with high-
time. dimensional spaces and effectively differentiates
between phishing sites and legitimate sites. The
4. “A Comparative Study of Malicious URL Random Forest is an ensemble method that combines
Detection: Regular Expression Analysis, Machine multiple decision trees to increase the classification
Learning, and VirusTotal API” Jason Misquitta, accuracy. Decision Tree: It is a tree-like model that
Dr. Anusha K(2023). details how integration of splits data on the basis of feature thresholds.
API to the detection mechanism can highly
impact a model in all aspects.

2
KNN- K Nearest Neighbors: It is a non- extended to combine results both for better decision-
parametric classification technique. This making.
algorithm predicts the classification of the URL
based on the closeness of neighbors in feature Decision Fusion: Both methods may be combined by
space. applying a strategy of decision fusion:

AdaBoost : It is an algorithm that uses boosting. Independent Prediction: Each makes an independent
Multiple weak learners are combined together to prediction. A final prediction can then be obtained
create a strong learner through voting, say, by majority vote or through a
weighted combination of the models according to
Neural Network: In case of a complex set of tasks their accuracy.
like feature extraction or classification, neural
network is used. Confidence Scoring: The system could assign a
confidence score to each method if both methods
These models are trained on labelled data that provide a prediction. For instance, if the machine
includes both phishing and legitimate websites by learning model and Urlscan.io API both have strong
taking the features extracted as input to the model. evidence of phishing, the system may output a higher
While training, these models would be able to predict confidence in labeling the URL as phishing.
that a given URL is a phishing one or legitimate with
respect to the features obtained from the content. Fallback Mechanism: If one method (say, the
machine learning model) cannot make a prediction
2. API Integration (Urlscan.io) due to missing features or a low confidence score, the
system can fall back on the Urlscan.io API for an
Along with the machine learning model, the proposed external check.
system incorporates the Urlscan.io API for the
external URL scanning. The Urlscan.io API analyzes 4. User Interface and Experience:The Streamlit web
the provided URL and gives comprehensive output application is the front-end interface through
about the reputation of a website, which may even which users will interact with the system. It allows
include previous phishing activity along with the users to:
domain.
Input URLs: The user can input any URL for
API Workflow: The Urlscan.io API is utilized at two scanning.
stages:
Select Detection Method: Phish Detector offers an
Submission URL: Every time the user attempts to option of whether the user wants his/her input
input the URL within the system, it gets first checked from machine learning model or API from
submitted to the Urlscan.io API to be scanned. The Urlscan.io.
service scans that URL against the global database
and analyzes that website for all the minute details, View Results: Once the analysis is done, it will
like scanning for any sign of malicious activities like display the results on the screen whether the URL is
phishing attempts, suspicious redirects, or other kinds legitimate or phishing and gives more information
of cybercrimes. like scan ID for Urlscan.io and prediction confidence
for the machine learning model.
Result Retrieval: After submission, the system waits
for the scan to finish and retrieves the results. The 5. Scalability and Real-Time Detection
system uses the scan ID to access the detailed report
containing verdicts and categorizations, whether This means it can easily scale for a large number of
phishing, malware, or any other malicious activity. requests in real-time, thus making use of external
APIs to scan new machine learning threats. This kind
3. 3Hybrid Model for Increased Accuracy of system is created to be updated real-time on the
dynamics occurring once some malicious URLs are
The innovation in the model presented is that it introduced into the system; that would ensure it can
makes use of the outputs both of the machine pinpoint with accuracy all types of known or
learning model and Urlscan.io API. Users can either unfamiliar phishing websites.
decide to utilize the ML model or Urlscan.io API to
determine phishing. The system may further be

3
6. Areas of Future Developments:While the proposed
model could be a strong solution towards phishing
detection, future work might be on several other
aspects:

Model performance optimization: The system will

benefit from continuous training by new data for
improvement of machine learning model
performance.

Real-time feedback: Incorporating continuous

learning models for improvement of real-time
feedback loop as the phishing URLs keep emerging.
Fig.2
Extended API Integration : Other URL scanning
services and even threat intelligence APIs might be
integrated to increase detection in the system.

IV. IMPLEMENTATION:

The proposed phishing detection system integrates

machine learning and Urlscan.io APIs to detect
phishing attacks on external websites. It extracts
lexical features, such as the structure of the domain,
HTTPs usage, and Html content, from the website
using BeautifulSoup and then trains many models,
for example: Naive Bayes SVM Random Forests 4
classifying the attack. Allows users to enter URLs
inputting detection method ( using ML or
URLscan,io) as well viewing results. Urlscan.io API Fig.3
scans the sites on phishing activities, and its features
extracted machine learning model classified these VI. CONCLUSION
sites. The system used for this application with the
streamlit is scalable, doing real-time analysis, as well The proposed hybrid phishing detection model
as secure efficient processing. combines both machine learning classification and
real-time URL scanning through the Urlscan.io API
V. RESULTS and provides a scalable, efficient, and accurate
solution to identify phishing websites. By leveraging
lexical features and external threat intelligence, the
model can detect a wide variety of phishing tactics,
hence being a powerful tool to enhance online
security.

VII. ACKNOWLEDGEMENT

Finally, I express my gratitude for the contributions

of the broader research community and the open-
source resources that have facilitated this work. This
research would not have been possible without the
support and encouragement of my peers.

Fig.1

4
VIII. REFERENCES:

10. “Detecting Malicious URLs Using Machine

1. “Dynamic Malware Classification and API
Learning Techniques” , Alotaibijabri; Hanan S.
Categorisation of Windows Portable Executable
Altamimi; Shahd A. Albelali; Maimunah Al-
Files Using Machine Learning“ by Durre Zehra
Harbi; Haya T. Alhuraib; Najd K. Alotaibi, 14
Syeda *ORCID and Mamoona Naveed
November 2022.
Asghar ,Published: 25 January 2024.

2. 5“Detection of malicious URLs using machine

learning” Published: 06 March 2024,Nuria Reyes-
Dorta, Pino Caballero-Gil & Carlos Rosa-
Remedios.

3. “An Identification and Analysis of Harmful URLs

through the Application of Machine Learning
Techniques”, Swagat M. Gavali Shital Kakad,
Swapnaja Amol, Ashwini B. Gavali, Sonali B.
Gavali. published:23.02.2024.

4. “Malicious URL Detection and Classification

Analysis using Machine Learning
Models”,January 2023,Upendra Shetty and
Anusha Patil, Mohana

5. “A text classification approach to API type

resolution for incomplete code snippets”,
Camilo Velázquez-Rodríguez a, Dario Di
Nucci b, Coen De Roover. April 2023.

6. “URL Classification Based on Lexical Features by

Machine Learning” , Cing Gel Vung; Yu Yu
Win.19 July 2023.

7.“Classification of Malicious URLs Using

Machine Learning” , Shayan Abad, Hassan
Gholamy, Mohammad Aslani, University of
Gävle, September 2023

8.“Malicious Software Detection based on URL-

API Intensity Feature Selection Using Deep
Spectral Neural Classification for Improving Host
Security” , B. Lavanya and C. Shanthi, 2023.

9. “PHISHING WEBSITE DETECTION USING

NOVEL MACHINE LEARNING FUSION
APPROACH” , Arikatla gopi Venkata Sudheer,
Aravapalli Sujith Kumar, MARCH – 2022.

Malicious URL Detection Using Machine Learning: Mr. Swapnil Thorat
No ratings yet
Malicious URL Detection Using Machine Learning: Mr. Swapnil Thorat
18 pages
Maliciousurlpaper
No ratings yet
Maliciousurlpaper
6 pages
Comparative Evaluation of Machine Learning Models For Malicious URL Detection
No ratings yet
Comparative Evaluation of Machine Learning Models For Malicious URL Detection
7 pages
Phishing Final
No ratings yet
Phishing Final
13 pages
Using Lexical Features For Malicious URL Detection - A Machine Learning Approach
No ratings yet
Using Lexical Features For Malicious URL Detection - A Machine Learning Approach
6 pages
Batch 18-Journal
No ratings yet
Batch 18-Journal
7 pages
Phishing URL Detection Research Paper
No ratings yet
Phishing URL Detection Research Paper
12 pages
(IJIT-V10I6P4) :roopesh Kumar B N, Rekha B Venkatapur, Suman B S, Gagan Shivanna
No ratings yet
(IJIT-V10I6P4) :roopesh Kumar B N, Rekha B Venkatapur, Suman B S, Gagan Shivanna
5 pages
Detecting Malicious Urls Using Lexical Analysis: (Msi - Mamun, Mahmad - Rathore, A.Habibi.L, Natalia, Ghorbani) @unb - Ca
No ratings yet
Detecting Malicious Urls Using Lexical Analysis: (Msi - Mamun, Mahmad - Rathore, A.Habibi.L, Natalia, Ghorbani) @unb - Ca
16 pages
Sensors 23 07760
No ratings yet
Sensors 23 07760
14 pages
Malicious URL Detection and Classification Analysis Using Machine Learning Models
No ratings yet
Malicious URL Detection and Classification Analysis Using Machine Learning Models
9 pages
Phishing Detection Using ML
No ratings yet
Phishing Detection Using ML
11 pages
Paper 7AdvancesinEngineeringSoftware
No ratings yet
Paper 7AdvancesinEngineeringSoftware
6 pages
Empirical Study On Malicious URL Detection Using Machine Learning
No ratings yet
Empirical Study On Malicious URL Detection Using Machine Learning
9 pages
Based On URL Feature Extraction
No ratings yet
Based On URL Feature Extraction
6 pages
Fin Irjmets1682919970
No ratings yet
Fin Irjmets1682919970
5 pages
Paper 19-Malicious URL Detection Based On Machine Learning
No ratings yet
Paper 19-Malicious URL Detection Based On Machine Learning
6 pages
Phishing Detection Website
No ratings yet
Phishing Detection Website
7 pages
DETECTION OF MALICIOUS URLS - Copy (2) - 18
No ratings yet
DETECTION OF MALICIOUS URLS - Copy (2) - 18
22 pages
Phishing Detection Website Base Paper
No ratings yet
Phishing Detection Website Base Paper
8 pages
Analysis For Malicious URLs Using
No ratings yet
Analysis For Malicious URLs Using
17 pages
Malicious - Url - Detect - 1BY21IS087,88
No ratings yet
Malicious - Url - Detect - 1BY21IS087,88
5 pages
128 Submission
No ratings yet
128 Submission
7 pages
URL Phishing
No ratings yet
URL Phishing
36 pages
MaliciousURLDetection Acomparativestudy
No ratings yet
MaliciousURLDetection Acomparativestudy
6 pages
Detection of Malicious Urls Using Machine Learning Techniques
No ratings yet
Detection of Malicious Urls Using Machine Learning Techniques
5 pages
Phishing
No ratings yet
Phishing
10 pages
Applsci 12 12030 v2
No ratings yet
Applsci 12 12030 v2
14 pages
Phishing Website Detection Using ML 2-1
No ratings yet
Phishing Website Detection Using ML 2-1
20 pages
Mini Project Report Sample Format 2024 - Final
No ratings yet
Mini Project Report Sample Format 2024 - Final
80 pages
Second Review
No ratings yet
Second Review
26 pages
Report
No ratings yet
Report
35 pages
Malicious URL Detection Using Random Forest
No ratings yet
Malicious URL Detection Using Random Forest
36 pages
Malicious Url Detection
No ratings yet
Malicious Url Detection
14 pages
Malicious URL Detection Using Machine Learning Tec
No ratings yet
Malicious URL Detection Using Machine Learning Tec
12 pages
Depuuu DOCNW
No ratings yet
Depuuu DOCNW
28 pages
PUMMP: Phishing URL Detection Using Machine Learning With Monomorphic and Polymorphic Treatment of Features
No ratings yet
PUMMP: Phishing URL Detection Using Machine Learning With Monomorphic and Polymorphic Treatment of Features
20 pages
A Multi-Algorithm Approach For Phishing Uniform Resource Locator's Detection
No ratings yet
A Multi-Algorithm Approach For Phishing Uniform Resource Locator's Detection
10 pages
Phishing URL Detection Using ML: Project Report
No ratings yet
Phishing URL Detection Using ML: Project Report
25 pages
Phishing Detection Using Machine Learnin
No ratings yet
Phishing Detection Using Machine Learnin
5 pages
Malicious Url Detection Based On Machine Learning
No ratings yet
Malicious Url Detection Based On Machine Learning
52 pages
INFOCOMP+Journal+Final 3
No ratings yet
INFOCOMP+Journal+Final 3
6 pages
Research - Paper - Group-B5
No ratings yet
Research - Paper - Group-B5
4 pages
20mis0106 VL2023240102875 Pe003
No ratings yet
20mis0106 VL2023240102875 Pe003
42 pages
ICT4SD Published Version
No ratings yet
ICT4SD Published Version
11 pages
Phishing Review 2023
No ratings yet
Phishing Review 2023
17 pages
Phishing URL Detection Presentation
No ratings yet
Phishing URL Detection Presentation
12 pages
Llms Are One-Shot Url Classifiers and Explainers: Fariza Rashid, Nishavi Ranaweera, Ben Doyle, Suranga Seneviratne
No ratings yet
Llms Are One-Shot Url Classifiers and Explainers: Fariza Rashid, Nishavi Ranaweera, Ben Doyle, Suranga Seneviratne
17 pages
Review Paper
No ratings yet
Review Paper
9 pages
Detecting Malicious Urls:trends, Challenges and The Role of Browser Extensions
No ratings yet
Detecting Malicious Urls:trends, Challenges and The Role of Browser Extensions
11 pages
Updated Phishing Url Detection
No ratings yet
Updated Phishing Url Detection
13 pages
V6I602
No ratings yet
V6I602
8 pages
Malicious URL Detection Via Pretrained Language Model Guided Multi-Level Feature Attention Network
No ratings yet
Malicious URL Detection Via Pretrained Language Model Guided Multi-Level Feature Attention Network
11 pages
Automated Phishing Detection Through URL Analysis and Machine Learning
No ratings yet
Automated Phishing Detection Through URL Analysis and Machine Learning
9 pages
Malicious URL
No ratings yet
Malicious URL
11 pages
(IJETA-V11I3P35) : Ms. Apoorva Joshi, Ms. Apoorva Joshi, Manvi Bhardwaj
No ratings yet
(IJETA-V11I3P35) : Ms. Apoorva Joshi, Ms. Apoorva Joshi, Manvi Bhardwaj
4 pages
Ieee Paper
No ratings yet
Ieee Paper
3 pages
Trust Confidence Hit Phishing Website Detection Using Random Forest (RF) Model
No ratings yet
Trust Confidence Hit Phishing Website Detection Using Random Forest (RF) Model
8 pages
Metasploit Techniques and Workflows: Definitive Reference for Developers and Engineers
From Everand
Metasploit Techniques and Workflows: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenAI Development Guide: Definitive Reference for Developers and Engineers
From Everand
OpenAI Development Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Trapezoid
No ratings yet
Trapezoid
1 page
Dvc-Rpu: Remote Paging Unit For The DVC System Manual
No ratings yet
Dvc-Rpu: Remote Paging Unit For The DVC System Manual
28 pages
Floor Plan (Questions)
No ratings yet
Floor Plan (Questions)
6 pages
Man 8055 Inst
No ratings yet
Man 8055 Inst
706 pages
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
No ratings yet
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
11 pages
Medium Voltage AC Drive MEGADRIVE-LCI Air-Cooled
No ratings yet
Medium Voltage AC Drive MEGADRIVE-LCI Air-Cooled
136 pages
Short Paper 6-2 DR
No ratings yet
Short Paper 6-2 DR
6 pages
Lab 9 Report
No ratings yet
Lab 9 Report
21 pages
DIU2013 Chapter 2
No ratings yet
DIU2013 Chapter 2
22 pages
DL - Intro
No ratings yet
DL - Intro
35 pages
FTP Server Configuration
No ratings yet
FTP Server Configuration
16 pages
IP Guard Knowledge 1
No ratings yet
IP Guard Knowledge 1
4 pages
Flight Price Prediction Document
No ratings yet
Flight Price Prediction Document
12 pages
Computer Fundamentals and Programming in C Instant Download
100% (1)
Computer Fundamentals and Programming in C Instant Download
12 pages
Fifa
No ratings yet
Fifa
70 pages
The 8086 Interrupt Mechanism: The 8259A PIC
No ratings yet
The 8086 Interrupt Mechanism: The 8259A PIC
14 pages
Studies in Informatics and Control Journal: Sample File For Articles Published in
No ratings yet
Studies in Informatics and Control Journal: Sample File For Articles Published in
2 pages
100 Fill Ups
No ratings yet
100 Fill Ups
26 pages
Computer Programming: Quarter 2 - Module 1
No ratings yet
Computer Programming: Quarter 2 - Module 1
13 pages
APG43 InternalWorkshop CSI
No ratings yet
APG43 InternalWorkshop CSI
56 pages
Attacking SMS
No ratings yet
Attacking SMS
66 pages
GAD Manual
No ratings yet
GAD Manual
86 pages
Inst Op2023
No ratings yet
Inst Op2023
82 pages
1.1 Configuring The Network
No ratings yet
1.1 Configuring The Network
14 pages
3 - Q1 Emp Tech
No ratings yet
3 - Q1 Emp Tech
13 pages
Project Management Guidebook
100% (1)
Project Management Guidebook
18 pages
Lk-p30 Manual Eng
No ratings yet
Lk-p30 Manual Eng
19 pages
Resumes and Cover Letters
No ratings yet
Resumes and Cover Letters
21 pages
Presentation On Google Apps
No ratings yet
Presentation On Google Apps
15 pages
Illustrated Parts & Service Map: HP Compaq dx7300 Microtower Business PC
No ratings yet
Illustrated Parts & Service Map: HP Compaq dx7300 Microtower Business PC
4 pages