0% found this document useful (0 votes)
31 views5 pages

Scalable Malicious URL Classification: Leveraging Lexical Analysis and API Integration

The document presents a hybrid approach for detecting malicious URLs using machine learning and API integration, specifically focusing on phishing attacks. It utilizes lexical features and the Urlscan.io API to enhance detection accuracy and scalability, allowing users to classify URLs as benign or malicious. The proposed system aims to provide real-time analysis and continuous improvement through ongoing training with new data.

Uploaded by

relliganesh16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views5 pages

Scalable Malicious URL Classification: Leveraging Lexical Analysis and API Integration

The document presents a hybrid approach for detecting malicious URLs using machine learning and API integration, specifically focusing on phishing attacks. It utilizes lexical features and the Urlscan.io API to enhance detection accuracy and scalability, allowing users to classify URLs as benign or malicious. The proposed system aims to provide real-time analysis and continuous improvement through ongoing training with new data.

Uploaded by

relliganesh16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Scalable Malicious URL Classification:

Leveraging Lexical Analysis and API


Integration

Sri Devi Sameera Mekhala Kota Jasper Surya Chowdary Jasti Nithin Ranganayakulu
Dept of CSE Dept of CSE Dept of CSE
Dhanekula Institute of Dhanekula Institute of Dhanekula Institute of
Engineering and Technology Engineering and Technology Engineering and Technology
Vijayawada, India Vijayawada, India Vijayawada, India
[email protected] [email protected] [email protected]

Meghana Koppula Chittabathina Venkata Sai Poojitha


Dept of CSE Dept of CSE
Dhanekula Institute of Engineering and Technology Dhanekula Institute of Engineering and Technology
Vijayawada, India Vijayawada, India
[email protected] [email protected]

Abstract—In the digital age, the proliferation of malicious Uniform Resource Locators(URLs) pose a significant cybersecurity
threat, enabling phishing attacks, malware infections, and data breaches that jeopardize sensitive information and user safety.
Traditional methods using static blacklists and heuristics are ineffective against evolving threats. This study proposes a
machine learning-based approach to enhance malicious URL detection, utilizing nine lexical features for efficient phishing
attack identification.
Leveraging the ISCXURL-2016 dataset, comprising 11,964 legitimate and phishing URLs, our model demonstrates effective
detection capabilities suitable for resourceconstrained devices. The classification model accurately distinguishes between
benign and malicious URLs by examining URL-derived features, metadata, and Application Program Interface(API)
integration.
This approach addresses the limitations of existing methods, which rely on numerous features requiring substantial processing
power. The objective is to develop a dynamic, scalable system capable of identifying new, unseen malicious URLs with high
accuracy.

Keywords—Malicious URL, Machine learning(ML), Random forest, Convolutional neural networks(CNN), k-nearest
neighbors(KNN), Blacklisting, API.

I. INTRODUCTION : detection of such websites requires scalable and


effective solutions, especially since phishing 1tactics
Phishing attacks are a constant threat to online are continually evolving. This paper, titled \"Scalable
security. These attacks target users to steal URL Classification: Leveraging Lexical Features and
sensitive data through fraudulent websites. The API Integration," presents a hybrid approach for

1
phishing website detection using lexical features
and API integration. III. PROPOSED MODEL:

The approach integrates two methods of This phishing detection model utilizes two
classifying URLs, including a machine learning complementing techniques: classification with the help of
(ML) model that analyses the lexical features machine learning, and external URL scanning with the
from the website content and an external service help of the Urlscan.io API. This hybrid approach
Urlscan.io, that delivers detailed URL scanning presented here is going to enhance the scalability and
results. The ML model utilizes supervised accuracy of phishing detection with the power of machine
learning algorithms, for instance, Gaussian Naive learning and real-time threat intelligence provided by the
Bayes and Random Forest, for phishing attempts Urlscan.io service.
based on the structure and content of web pages.
The Urlscan.io API, however, scans a URL 1. Machine Learning Model: The machine learning
against a huge database of known phishing sites, component of the proposed system is supposed to
which gives an additional layer of detection. classify a website as phishing or not, based on
lexical and structural features extracted from the
This system, implemented in a Streamlit web content of the webpage. The model processes
application, allows users to input a URL and URLs by extracting relevant features from the
choose between the ML model or the API for Hyper Text Markup Lnaguage(HTML) source
phishing detection. Combining these techniques, code of the website, which are then used to train
the paper shows how scalable and flexible various supervised learning algorithms.
solutions can improve the accuracy and
efficiency of phishing detection, providing a Feature Extraction: For detecting phishing websites,
robust tool for enhancing online security. the system focuses on lexical features such as

II. LITERATURE REVIEW: Domain Name Features: The length of the domain
name, the existence of suspicious words such as
1. Malicious URL Detection using Machine "login," "secure," and similarity to popular legitimate
Learning: A Survey” Doyen Sahoo, Chengaho sites.
Liu, Steven C.H Hoi, -Salesforce Research
Asia(2018). Discussed how URL attacks have URL Structure: Number of subdirectories, presence
climbed the top cyber attacks and also discuss of HTTPS, and presence of unusual characters in the
about the scope of effective machine learning URL.
algorithms to detect such URLs.
Content Features: HTML structure, presence of
hidden iframes, suspicious forms, and misleading
2. “Classification of Malicious URLs Using hyperlinks that indicate phishing sites.
Machine Learning” Shayan Abad, Hassan
These features are then fed through the
Gholamy, Mohammad Aslani(2023). Provides in
BeautifulSoup library to parse the HTML content of a
depth analysis about usage of machine learning
given page, and a feature vector is created for each
models and their impact on recall, precision and
URL.
F1 scores.
Supervised Learning Models
3. “Malicious URL Detection based on Machine
2Learning” Do Cho, Hoa Dinh, Tisenko
To classify the URLs, various supervised machine
Victor(2023). Explains various techniques and learning algorithms have been used, among which
micro tweakings for algorithm parameters which are: Gaussian Naive Bayes: It is a probabilistic
significantly impact the outcome such as number classifier which applies Bayes' theorem; useful when
of trees to be used for decision tree model, and features are conditionally independent.
also how certain libraries such as SparkML can
significantly improve the training and testing SVM is a classifier that works well with high-
time. dimensional spaces and effectively differentiates
between phishing sites and legitimate sites. The
4. “A Comparative Study of Malicious URL Random Forest is an ensemble method that combines
Detection: Regular Expression Analysis, Machine multiple decision trees to increase the classification
Learning, and VirusTotal API” Jason Misquitta, accuracy. Decision Tree: It is a tree-like model that
Dr. Anusha K(2023). details how integration of splits data on the basis of feature thresholds.
API to the detection mechanism can highly
impact a model in all aspects.

2
KNN- K Nearest Neighbors: It is a non- extended to combine results both for better decision-
parametric classification technique. This making.
algorithm predicts the classification of the URL
based on the closeness of neighbors in feature Decision Fusion: Both methods may be combined by
space. applying a strategy of decision fusion:

AdaBoost : It is an algorithm that uses boosting. Independent Prediction: Each makes an independent
Multiple weak learners are combined together to prediction. A final prediction can then be obtained
create a strong learner through voting, say, by majority vote or through a
weighted combination of the models according to
Neural Network: In case of a complex set of tasks their accuracy.
like feature extraction or classification, neural
network is used. Confidence Scoring: The system could assign a
confidence score to each method if both methods
These models are trained on labelled data that provide a prediction. For instance, if the machine
includes both phishing and legitimate websites by learning model and Urlscan.io API both have strong
taking the features extracted as input to the model. evidence of phishing, the system may output a higher
While training, these models would be able to predict confidence in labeling the URL as phishing.
that a given URL is a phishing one or legitimate with
respect to the features obtained from the content. Fallback Mechanism: If one method (say, the
machine learning model) cannot make a prediction
2. API Integration (Urlscan.io) due to missing features or a low confidence score, the
system can fall back on the Urlscan.io API for an
Along with the machine learning model, the proposed external check.
system incorporates the Urlscan.io API for the
external URL scanning. The Urlscan.io API analyzes 4. User Interface and Experience:The Streamlit web
the provided URL and gives comprehensive output application is the front-end interface through
about the reputation of a website, which may even which users will interact with the system. It allows
include previous phishing activity along with the users to:
domain.
Input URLs: The user can input any URL for
API Workflow: The Urlscan.io API is utilized at two scanning.
stages:
Select Detection Method: Phish Detector offers an
Submission URL: Every time the user attempts to option of whether the user wants his/her input
input the URL within the system, it gets first checked from machine learning model or API from
submitted to the Urlscan.io API to be scanned. The Urlscan.io.
service scans that URL against the global database
and analyzes that website for all the minute details, View Results: Once the analysis is done, it will
like scanning for any sign of malicious activities like display the results on the screen whether the URL is
phishing attempts, suspicious redirects, or other kinds legitimate or phishing and gives more information
of cybercrimes. like scan ID for Urlscan.io and prediction confidence
for the machine learning model.
Result Retrieval: After submission, the system waits
for the scan to finish and retrieves the results. The 5. Scalability and Real-Time Detection
system uses the scan ID to access the detailed report
containing verdicts and categorizations, whether This means it can easily scale for a large number of
phishing, malware, or any other malicious activity. requests in real-time, thus making use of external
APIs to scan new machine learning threats. This kind
3. 3Hybrid Model for Increased Accuracy of system is created to be updated real-time on the
dynamics occurring once some malicious URLs are
The innovation in the model presented is that it introduced into the system; that would ensure it can
makes use of the outputs both of the machine pinpoint with accuracy all types of known or
learning model and Urlscan.io API. Users can either unfamiliar phishing websites.
decide to utilize the ML model or Urlscan.io API to
determine phishing. The system may further be

3
6. Areas of Future Developments:While the proposed
model could be a strong solution towards phishing
detection, future work might be on several other
aspects:

Model performance optimization: The system will


benefit from continuous training by new data for
improvement of machine learning model
performance.

Real-time feedback: Incorporating continuous


learning models for improvement of real-time
feedback loop as the phishing URLs keep emerging.
Fig.2
Extended API Integration : Other URL scanning
services and even threat intelligence APIs might be
integrated to increase detection in the system.

IV. IMPLEMENTATION:

The proposed phishing detection system integrates


machine learning and Urlscan.io APIs to detect
phishing attacks on external websites. It extracts
lexical features, such as the structure of the domain,
HTTPs usage, and Html content, from the website
using BeautifulSoup and then trains many models,
for example: Naive Bayes SVM Random Forests 4
classifying the attack. Allows users to enter URLs
inputting detection method ( using ML or
URLscan,io) as well viewing results. Urlscan.io API Fig.3
scans the sites on phishing activities, and its features
extracted machine learning model classified these VI. CONCLUSION
sites. The system used for this application with the
streamlit is scalable, doing real-time analysis, as well The proposed hybrid phishing detection model
as secure efficient processing. combines both machine learning classification and
real-time URL scanning through the Urlscan.io API
V. RESULTS and provides a scalable, efficient, and accurate
solution to identify phishing websites. By leveraging
lexical features and external threat intelligence, the
model can detect a wide variety of phishing tactics,
hence being a powerful tool to enhance online
security.

VII. ACKNOWLEDGEMENT

Finally, I express my gratitude for the contributions


of the broader research community and the open-
source resources that have facilitated this work. This
research would not have been possible without the
support and encouragement of my peers.

Fig.1

4
VIII. REFERENCES:

10. “Detecting Malicious URLs Using Machine


1. “Dynamic Malware Classification and API
Learning Techniques” , Alotaibijabri; Hanan S.
Categorisation of Windows Portable Executable
Altamimi; Shahd A. Albelali; Maimunah Al-
Files Using Machine Learning“ by Durre Zehra
Harbi; Haya T. Alhuraib; Najd K. Alotaibi, 14
Syeda *ORCID and Mamoona Naveed
November 2022.
Asghar ,Published: 25 January 2024.

2. 5“Detection of malicious URLs using machine


learning” Published: 06 March 2024,Nuria Reyes-
Dorta, Pino Caballero-Gil & Carlos Rosa-
Remedios.

3. “An Identification and Analysis of Harmful URLs


through the Application of Machine Learning
Techniques”, Swagat M. Gavali Shital Kakad,
Swapnaja Amol, Ashwini B. Gavali, Sonali B.
Gavali. published:23.02.2024.

4. “Malicious URL Detection and Classification


Analysis using Machine Learning
Models”,January 2023,Upendra Shetty and
Anusha Patil, Mohana

5. “A text classification approach to API type


resolution for incomplete code snippets”,
Camilo Velázquez-Rodríguez a, Dario Di
Nucci b, Coen De Roover. April 2023.

6. “URL Classification Based on Lexical Features by


Machine Learning” , Cing Gel Vung; Yu Yu
Win.19 July 2023.

7.“Classification of Malicious URLs Using


Machine Learning” , Shayan Abad, Hassan
Gholamy, Mohammad Aslani, University of
Gävle, September 2023

8.“Malicious Software Detection based on URL-


API Intensity Feature Selection Using Deep
Spectral Neural Classification for Improving Host
Security” , B. Lavanya and C. Shanthi, 2023.

9. “PHISHING WEBSITE DETECTION USING


NOVEL MACHINE LEARNING FUSION
APPROACH” , Arikatla gopi Venkata Sudheer,
Aravapalli Sujith Kumar, MARCH – 2022.

You might also like