0% found this document useful (0 votes)
13 views13 pages

Few-Shot API Attack Anomaly Detection in a Classification-by-Retrieval Framework

Uploaded by

Bhaskar Banerjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

Few-Shot API Attack Anomaly Detection in a Classification-by-Retrieval Framework

Uploaded by

Bhaskar Banerjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1

Few-Shot API Attack Anomaly Detection in a


Classification-by-Retrieval Framework
Udi Aharon, Ran Dubin, Amit Dvir, and Chen Hajaj

Abstract—Application Programming Interface (API) attacks adoption of web APIs has heightened the potential of user
refer to the unauthorized or malicious use of APIs, which are safety and privacy breaches, making APIs a prime target for
often exploited to gain access to sensitive data or manipulate on- cyber attackers [3]. In recent years, there have been several
arXiv:2405.11247v1 [cs.CR] 18 May 2024

line systems for illicit purposes. Identifying actors that deceitfully


utilize an API poses a demanding problem. Although there have high-profile API attacks [8]–[11], such as the Zoom video
been notable advancements and contributions in the field of API conferencing platform in 2020.
security, there still remains a significant challenge when dealing
To address these challenges, the Open Web Application
with attackers who use novel approaches that don’t match the
well-known payloads commonly seen in attacks. Also, attackers Security Project (OWASP1 ) provides resources, tools, and best
may exploit standard functionalities in unconventional manners practices to help organizations and developers enhance the
and with objectives surpassing their intended boundaries. This security of their web applications and protect against malicious
means API security needs to be more sophisticated and dynamic attacks. One of the most well-known contributions is the
than ever, with advanced computational intelligence methods,
OWASP API Top 10, which outlines the ten most critical API
such as machine learning models that can quickly identify and
respond to anomalous behavior. In response to these challenges, security risks [12].
we propose a novel few-shot anomaly detection framework, Despite the progress made in prior research on utilizing
named FT-ANN. This framework is composed of two parts:
First, we train a dedicated generic language model for API machine learning and deep learning models for protecting
based on FastText embedding. Next, we use Approximate Nearest APIs against both known and unknown attacks [13]–[15],
Neighbor search in a classification-by-retrieval approach. Our there are lingering concerns that remain unresolved. These
framework enables the development of a lightweight model that concerns encompass several challenges, such as effectively
can be trained with minimal examples per class or even a model detecting zero-day vulnerabilities, minimizing false positives,
capable of classifying multiple classes. The results show that our
framework effectively improves API attack detection accuracy and addressing real-time and continuous protection require-
compared to various baselines. ments. Given that a zero-day API attack involves an unknown
vulnerability that the security solutions, such as web applica-
Index Terms—API Security, Anomaly Detection, Few-Shot
Learning, ANN, Classification-by-Retrieval, NLP. tion firewalls, are unaware of, it becomes necessary to employ
few-shot learning techniques. The anomaly detection model
needs to leverage its comprehension of previously encountered
I. I NTRODUCTION samples to make predictions on new, unseen samples. In light
PPLICATION Programming Interface (API) refers to of the requirement to improve traditional methods for training
A a set of routines, procedures, resources, and protocols
that permit the interaction between software systems and data
binary anomaly classifiers, the utilization of Classification-
by-Retrieval presents a solution [16]. This approach enables
exchange services [1]–[3]. APIs are an evolving technology the construction of neural network-based classifiers without
for orchestrating applications utilizing web technology [4], the need for computationally intensive training procedures.
[5]. Recently, it has been argued that we are currently living Consequently, it facilitates the development of a lightweight
in an API economy [5] due to the growing interconnect- model that can be trained with minimal examples per class or
edness of people, applications, and systems, all of which even a model capable of classifying multiple classes [17]–[19].
are powered by APIs. These interfaces now serve as the The remaining sections of this paper are structured in the
foundational framework of the digital ecosystem, establishing following manner. In Section II, we summarize the contribu-
connections between industries and economies to foster value tions made in this paper, Section III provides an overview
creation and cultivate innovative capabilities [2]. APIs find of the relevant literature. The architecture of the FT-ANN
utilization across a broad spectrum of services, including solution is presented in Section IV. The experimental design
Web applications, Operation Systems (OS), Databases, and is outlined in Section V. Section VI focuses on measuring the
Hardware [6]. The increasing ease of building web appli- effectiveness of FT-ANN and presenting experimental results
cations has led to a rise in agile development, where even in comparison to various state-of-the-art benchmarks. Finally,
inexperienced engineers can deploy applications. However, we draw our conclusions, current research limitations, and
this approach often lacks strong security design or hardening suggestions for future work in Section VII.
planning. This can lead to vulnerabilities in application logic
and inadequate consideration of security impacts. For instance,
failure to properly constrain resources or access levels could
lead to denial of service attacks [7]. As a result, the extensive 1 The Open Web Application Security Project®, https://owasp.org/
2

II. O UR C ONTRIBUTION Gniewkowski et al. [23] suggested an NLP-based semi-


In this paper, we introduce a novel unsupervised few- supervised anomaly detection methodology that employs
shot anomaly detection framework, utilizing FastText embed- RoBERTa for embedding HTTP requests to detect anomalies.
ding and Approximate Nearest Neighbor search (FT-ANN), They evaluated the pipeline over the CSIC 2010 and CSE-CIC-
which leverages a Classification-by-retrieval approach. We IDS 2018 published datasets, in addition to UMP, a custom-
enable training of a single retrieval model capable of han- generated dataset, achieving F1-scores of 96.9%, 92.6%, and
dling multiple baselines simultaneously. This approach not 99.9%, respectively. A notable limitation in their method is
only saves computational resources but also simplifies the the relatively long training and inference time, which presents
model deployment and maintenance processes. By reducing a considerable drawback in real-time systems.
the number of models required, our methodology offers a Jemal et al. [24] propose a supervised Memory CNN that
more efficient and scalable solution for anomaly detection. combines a CNN and Long Short-Term Memory (LSTM) to
Moreover, unlike traditional anomaly detection models, ANN identify patterns of malicious requests within sequences of
supports incremental index updates. requests. They evaluated the model using the CSIC 2010
Additionally, we define a novel tokenizer that specifically dataset. Niu et al. [25] propose a technique for detecting
emphasizes the language factors present in APIs, addressing web attacks based on a supervised CNN and Gated Recurrent
the unique challenges associated with API-based natural lan- Unit (GRU). They extracted statistical features and utilized
guage processing. Unlike existing tokenizers, our approach a Word2Vec model to extract word embeddings, resulting in
takes into account the specific language characteristics of a 3-dimensional input for the suggested CNN-GRU method.
APIs, enabling more accurate and efficient processing of API- A fully connected layer was used for classification, and
related text. APIs are defined by a URL-based syntax in which the experiment was made on the CSIC 2010 dataset. Yu
each URL corresponds to a particular resource or action. They et al. [26] combined a CNN and Support Vector Machine
also include fundamental actions such as GET and POST, (SVM) to detect malicious web server requests. The CSIC
which determine how requests and responses are structured. 2010 dataset was selected to validate proposed approach. Baye
Additionally, APIs utilize standard structure of HTTP headers et al. [14] also utilized SVM with a Linear Kernel as a two-
to transmit metadata pertaining to both the request and the class classifier to identify anomalous API requests. To form
anticipated response. a training dataset that accurately represented authentic API
Furthermore, our language model is designed to be domain- logs, they employed a technique for outlier detection based
agnostic, eliminating the need for retraining when transitioning on Gaussian Distribution, generating a synthetic dataset with
to different API domains. This flexibility allows our model to labeled examples.
Moradi et al. [27] introduced an unsupervised anomaly
seamlessly serve various domains without sacrificing perfor-
detection technique utilizing Auto-Encoder LSTM for feature
mance or requiring additional training efforts.
extraction and Isolation Forest for classification. They applied
Lastly, its agnostic nature allows it to seamlessly adapt
this model to the CSIC 2010 dataset, achieving an F1-score
and address the unique requirements and challenges as also
of 81.96%. Although, they faced limitations related to the
outlined in the OWASP Top 10 API vulnerabilities, posed
choice of encoding method for HTTP data, the non-stationary
by different API forms include REST, GraphQL, gRPC, and
nature of HTTP data, and the integration of multiple feature
WebSockets, regardless of the emphasis on HTTP datasets
sets, which may impact the effectiveness of the proposed
during the demonstration.
approach. In more recent research [28], they also proposed an
unsupervised Deep Support Vector method. They conducted
III. R ELATED W ORK a comparison between two feature extraction approaches,
Reddy et al. [20] proposed supervised sequence models namely bigram (2-gram) and one-hot. They evaluated it on the
based on Recurrent Neural Networks (RNN) to identify ma- CSIC 2010 and ECML/PKDD 2007 datasets, achieving F1-
licious injections in API requests in addition to a heuristic scores of 89% and 79.48%, respectively. They pointed out the
rule which classified 10% of each request sequence with a limitation of lacking support for data streams in incremental
probability of 60% as valid to minimize the number of false learning, emphasizing the need for future research to address
positives. However, such rules can introduce subjective biases this aspect.
and limitations. They generated a custom-labeled dataset, but
used only the request payloads. This presents a significant IV. F RAMEWORK
constraint in real-world scenarios, where malicious actor could The analysis of API requests can be framed as a problem in
potentially manipulate a majority of the request components, NLP. One challenge lies in selecting a language model capable
including headers [21]. They compared six unidirectional and of generating a vector space representation. In our work, we
bidirectional RNN models by evaluating various performance decided to utilize the FastText [29] model. The primary aim
measures and showed a 50% decrease in false positive cases. of FastText embeddings is to factor in the internal structure
Jemal et al. [22] suggested the Convolutional Neural Net- of words rather than simply learning word representations.
work (CNN) method to detect web attacks. They concluded This feature is especially advantageous for morphologically
that appropriately adjusting hyper-parameters and employing complex languages, allowing representations for various mor-
a data pre-processing approach significantly impacts the de- phological forms of words to be learned separately. Fast-
tection rate. Text offers Skip-gram and Continuous Bag-of-Words (CBOW)
3

Ref Attack Method Task Technique(s) Method(s) Dataset(s) Best Score


[20] SQL, XML and JSON at- Supervised RNN Word embedding Custom 98.13% F1-Score
tacks in HTTP request
[22] Attack in HTTP request Supervised CNN Word embedding, Charac- CSIC 2010 97.65% Accuracy
ter embedding
CSIC 2010 96.9% F1-Score
[23] Attack in HTTP request Semi-Supervised RoBERTa BBPE UMP 92.6% F1-Score
CSE-CIC-IDS2018 99.9% F1-Score
[24] Sequences of HTTP requests Supervised CNN, LSTM ASCII embedding CSIC 2010 98.53% F1-Score
attack
[25] Attack in HTTP request Supervised CNN, GRU Word embedding, Statisti- CSIC 2010 98.77% F1-Score
cal features
[26] Attack in HTTP request Supervised TextCNN, SVM Word embedding,t-SNE, CSIC 2010 99.3% F1-Score
Statistical features
[27] Attack in HTTP request Unsupervised Auto-Encoder LSTM Character embedding CSIC 2010 81.96% F1-Score
[28] Attack in HTTP request Unsupervised SVDD Character embedding CSIC 2010 89% F1-Score

TABLE I: Summary of Machine Learning NLP-based Techniques

models to compute word representations. Although CBOW API applications where rapid response times are essential.
learns faster than Skip-gram, Skip-gram outperforms CBOW Additionally, ANN methods are adaptable to high-dimensional
on small datasets [30]. FastText adopts a character n-gram spaces [34], [37], where traditional NN searches can suffer
approach to tokenize words, which effectively tackles Out-of- from the ”curse of dimensionality”, in which the computational
vocabulary (OOV) problems. This method not only generates requirements for exact NN search become prohibitively high
embeddings for common words but also for rare, misspelled, as the dimensionality of the dataset increases [36].
or previously unseen words in the training corpus. In the The framework proposed (depicted in Figure 1) comprises a
realm of API security, it is imperative to pay attention to the combination of a FastText embedding network and a retrieval
internal structure of words in order to grasp the intent and layer, which includes an ANN matching component and a
context behind API requests. API security entails scrutinizing result aggregation component built upon it, forming the FT-
the textual content within API requests to pinpoint potential ANN system. In Phase 1 (Figure 1), we initiate the process
threats or vulnerabilities [31]. These security threats could be by training a generic language model to serve as a reliable
concealed within apparently harmless text, which may involve baseline for the detection model. To gather a substantial
the use of uncommon or previously unseen terms, or even dataset, we conducted web crawling on a random sample of
attempts to obscure their intent through spelling errors [32]. websites from the Tranco top websites list, collecting over a
Moreover, APIs might be required to accommodate a diverse million examples of normal web traffic. This language model
range of languages, including those characterized by intricate now serves as a pre-trained baseline, applicable to any real-
morphological structures [33]. world API anomaly detection system.
Prior works have proposed various methods for detecting Proceeding to Phase 2 (Figure 1), the unsupervised detection
anomalies. While the majority of these approaches focus on model is trained, consisting of a pre-processing step and a
classic classification tasks, we propose the utilization of the Classification-by-retrieval framework. The pre-processing step
ANN vector similarity method for identifying in-distribution validates the HTTP headers against the standard structure. Ad-
records. The similarity search concept pertains to the pro- ditionally, a unique data transformation is applied to simplify
cess of identifying data points in a dataset that demonstrate the API vocabulary. The detection model is trained solely on
similarities with a specific pattern, commonly referred to as normal API traffic, and constructs a single ANN model for all
a query [34]. To measure the similarity of a pair of data endpoints. The term ”endpoint” refers to any data or metadata
points, a distance function is used, where a small distance that may represent a type, an origin, or identification of a
indicates that the two points are more similar or ”closer” to respective API request. Endpoint is defined as the combination
each other [35]. The NN search is a specific type of similarity of the method, host and path [38]. For each endpoint, a
search used to identify data points that are nearest in distance threshold is calculated and utilized during the detection stage.
to a provided query point [36]. The ANN allows search despite
Finally, in Phase 3 (Figure 1), the model’s performance is
the possibility of not retrieving all neighbors in a metric
validated by allowing an index search for every request. Dur-
space [36].
ing validation, we standardize the request using the same pre-
While both ANN and traditional NN are rooted in the fun- processing stages applied during the detection model training
damental concept of similarity search, traditional NN search phase. The text is then transformed into a vector represen-
involves an exhaustive examination of all data points to find tation using our generic language model through inference.
the NN, which can be prohibitively time-consuming for large These embedding vectors are used in the ANN search, which
datasets [36]. In contrast, ANN employs techniques that trade returns the K-NNs from the specific endpoint index collection,
off a slight loss in precision for substantial gains in speed, en- followed by a maximum distance scaling layer. The final score
abling the identification of ANN without examining every data obtained enables the model to ascertain whether the incoming
point. This efficiency becomes especially crucial in real-world API request is normal or an anomaly by comparing with the
4

Fig. 1: Our FT-ANN Framework

pre-defined threshold for each API endpoint. enabling it to describe various transactions in a consistent
format without sacrificing their original meanings. In fact, in
A. Data Pre-Processing the majority of cases, these requests have been intelligently
For both training and inferring the detection model, we merged into a single text representation. This consolidation
employ a unique pre-processing technique consisting of three not only streamlines the data but also ensures that the es-
phases, as depicted in Figure 1 in steps 2.1 and 3.1. First, sential information pertaining to different transactions remains
decoding URL special characters, decompressing request body preserved.
content, and converting every character within the request
data string into lowercase. Then, validating request headers
to ensure they are formatted correctly and extracting the
endpoint definition, as depicted in Figure 1 2.1.1 and 3.1.1.
API request headers typically provide information about the
request context, supply authentication credentials, and provide
information about the client (e.g., a person, a computing
device, and/or a browser application) that had initiated the API
[39]. API request header fields are typically derived from a
limited set of options. Accordingly, the data preparation also
uses a fixed set of rules to validate the content of request
headers and filter-out headers according to these rules. Headers
that include valid or approved strings may be transferred to
their destination as an API request and may be excluded
from additional processing. For example, request headers may
include host strings, which specify host or Internet Protocol
(IP) addresses and/or port numbers of a server to which the
API request is being sent. Valid IPv4 syntax should be in the
format of: (0 ≤ n < 256).(0 ≤ n < 256).(0 ≤ n < 256).(0 ≤
n < 256) : (1 ≤ n ≤ 65535). Fig. 2: Data pre-processing steps
Lastly, as shown in Figure 1 2.1.2 and 3.1.2, we con-
vert received requests into abstracted versions based on a
conversion schema. For instance, we replace non-numeric
single characters with the string ”chr”, which serves as a B. Unsupervised ML Language Model
representative, abstract version of the original request string. Our framework leverages FastText for unsupervised learn-
Another example involves converting non-textual symbols into ing. FastText possesses the capability to encapsulate substan-
predefined textual strings. for instance, colons (”:”) may be tive knowledge about words while integrating morphology
converted to the string ”colon”. During the pre-processing details, a crucial aspect for API attack detection. While deep
stage, the API language has been refined to achieve optimally, learning models have excelled state-of-the-art results across
5

various NLP tasks, to the best of our knowledge, no previous with increments of 0.1. The model considers the balance
NLP pre-trained model on the API traffic domain has been between precision and recall, as captured by the F1 score,
publicly published. In the training phase of the method, we to determine the optimal threshold value. As described in
built a single generic FastText language model (based on [29]) Figure 3, in the detection stage, the first position score with
from scratch using the normal API traffic collected by crawling the max normalized value is compared to the best threshold
Tranco’s list of the most popular websites, as can be seen in to determine anomaly.
Figure 1 Phase 1. For training the generic language model, we
used the default hyper-parameters of [29], which encompass
a learning rate of 5%, a word vector size of 100, and a context
window size of 5. The model was then trained for 5 epochs.
We utilized the CBOW model as the dataset is relatively
large, and CBOW embeddings are precise enough for anomaly
detection and computed in a shorter time than skip-gram [40].
For training the detection model and inferencing, we extract
the vector representation of words for every input line.

C. ANN
We obtained ANN to identify normal representation of an Fig. 3: Features extraction and classification-by-retrieval
API endpoint. We train a single detection model, which is
used to describe a normal representation of all endpoints.
During the detection model training stage, each API request V. E XPERIMENTAL D ESIGN
is represented as a vector in the textual embedding space, API security papers are not as prevalent as the technology
including endpoint information. We employ cosine distance itself, despite being one of the most influential technologies
to measure the similarity between data points as it has been [46]. This is particularly evident in the scarcity of ready-to-
applied in numerous text mining endeavors, such as text use publicly available API datasets [47]. Several of these
classification, and information retrieval [41]. Additionally, datasets are obsolete and unsuitable for usage, with some
it has been proven to be effective for Out-of-Distribution lacking traffic diversity and volumes, and others failing to
(OOD) detection tasks [42], [43]. Cosine similarity is a encompass a variety of attacks, such as ECML/PKDD 2007
widely used measure of similarity that calculates the angle [48] and CSE-CIC-IDS 2018 [49].Consequently, researchers
formed by a pair of vectors. When measuring the similarity resort to creating customized datasets by primarily utilizing
between two patterns, the Euclidean distance increases as they open-source vulnerable web applications like DVWA, BWAPP,
become less similar, while the cosine similarity increases as and Mutillidae, and employ automated penetration tools such
they become more similar. Unlike Euclidean distance, cosine as SQLMAP [50], SQLNINJA [51], and OWASP ZAP [52]
similarity is unaffected by the magnitude of the vectors being to gather malicious payloads [53]. Therefore, this research
compared [44]. The embedding vector feeds the Hierarchical is evaluated on two datasets: CSIC 2010 [54] and ATRDF
Navigable Small Worlds (HNSW) graph [45] to build new 2023 [55]. The HTTP CSIC 2010 dataset [54] is widely
indexes of data points. During the detection stage, the model used [23]–[27], [33], [56]–[62] in the field of malicious
compares new API vectors of incoming API requests in web traffic detection. This dataset was created by the Spanish
relation to API vectors of the same API endpoint information Research National Council (CSIC). It is a sample of the traffic
to evaluate normality or anomaly of the incoming API request. occurring on the Spanish e-commerce web application. The
The comparison is made by querying the similarity between dataset includes attacks such as SQL injection (Figure 4a),
the input vector to the closest k objects. buffer overflow, information gathering, files disclosure, CRLF
The ANN search returns a set of IDs representing neighbors’ Injection, XSS, static attacks and unintentional illegal requests.
points and the similarity score between the given point and While unintentional illegal requests lack malicious intent, they
its ID. Max distance scaling is employed to scale the ANN deviate from the typical behavior of the web application and
similarity score within the given range. For every score, the exhibit a different structure compared to regular parameter
maximum value gets transformed into a 0, and every other values. For instance, as shown in Figure 4b, an invalid DNI
value is divided by the maximum similarity score in the range (Spanish national ID number) was marked as an anomaly. We
and then subtracted from 1. We use this method to invert the divided the dataset into two segments: the training portion,
relationship between the original and the normalized scores which included 36,000 normal requests and was exclusively
to emphasize higher scores for smaller values. As in the utilized for representation learning, and the inference portion,
cosine space, a smaller distance indicates that the two points which comprised both 36,000 normal and 25,000 anomalous
are closer to each other. Let X be  the similarity
 score, the traffic that was encoded by the model and employed for
′ X
normalized score X’ is: X = 1 − max(X) detection. It has 38 different endpoints, 8 of which have no
Lastly, we suggest an adaptive search for the best threshold normal representation and were excluded from our experiment.
for each API endpoint. As part of the detection model training, The API Traffic Research Dataset Framework (ATRDF) [55]
the model iteratively evaluates thresholds between 0 and 1, is a recently published HTTP dataset publicly available which
6

includes 18 different API endpoints. The dataset includes


attacks such as Directory Traversal, Cookie Injection, Log4j
(Figure 4c), RCE, Log Forging, SQLi, and XSS. The dataset
contains 54,0000 normal and 78,000 abnormal sets of request
and response.
In response to the unavailability of a publicly accessible
pre-trained model specialized for the API domain, a generic
language model was developed to establish a reliable baseline
for various detection models. However, developing a robust (a) CSIC 2010 request sample
language model necessitates a significant amount of training
data. To address this requirement, an extensive dataset of
1,061,095 API examples was collected. This dataset was ob-
tained by performing a comprehensive data collection process,
involving the crawling of a random sample of websites from
the Tranco top websites list2 . The Tranco list is an invaluable
resource for cyber-security research as it provides a publicly
available compilation of the top one million most popular do-
mains, ranked based on a combination of four reputable lists:
Alexa, Cisco Umbrella, Majestic, and Quantcast. Additionally,
(b) CSIC 2010 unintentional illegal request
the Tranco list offers the advantage of being able to filter out
unavailable or malicious domains, making it a valuable asset
for our research [63].

VI. E VALUATION M ETRICS


In the realm of anomaly detection, it is commonplace to
integrate a binary classification layer into the model architec-
ture. This is due to the fundamental objective of distinguishing
between normal and abnormal instances. To evaluate our
architecture, we suggest using several performance measures (c) ATRDF 2023 request sample
in various experiments, including precision, recall, accuracy
Fig. 4: An example of anomaly requests from CSIC 2010 and
and F1-score. Actual values are represented as True and
ATRDF 2023 datasets where endpoint definition is marked in
False by (1) and (0), respectively and predicted as Positive
green, abnormal payload is marked in red.
and Negative values by (1) and (0), respectively. Predicted
possibilities of classification models are obtained through the
expressions TP, TN, FP, and FN.
F1-score: measures the accuracy of the instances that were
Precision: evaluates the accuracy of the positive predictions
classified incorrectly by a model. It is obtained by taking the
using the ratio of correctly predicted positive instances out of
harmonic mean of precision and recall.
all instances predicted as positive. It is obtained by dividing
the number of true positives by the sum of true positives and P recision × Recall
false positives. F 1-score = 2 × (4)
P recision + Recall
TP In the current problem, two classes are represented by
P recision = (1)
TP + FP Positive and Negative, where the positive class corresponds to
Recall: measures the proportion of actual positives identi- an abnormal API request and the negative class corresponds
fied correctly. It is obtained by dividing the number of true to a normal API request.
positives by the sum of true positives and false negatives.
VII. E XPERIMENTAL R ESULTS
TP
Recall = (2) To understand better the two datasets, we use the t-
TP + FN
Distributed Stochastic Neighbor Embedding (T-SNE) method
Accuracy: measures the overall correctness of the model
for dimensionality reduction to graphically depict our high-
predictions. It is obtained by dividing the total number of
dimensional datasets. The plots in Figure 5 indicate distinct
correct predictions by the total number of predictions made
separation among classes within a reduced dimensional space.
by the model.
This suggests that requests with similarities tend to be grouped
TP + TN together, enabling the exploration of a neighborhood for any
Accuracy = (3) given sample. Close records from different classes suggest that
TP + TN + FP + FN
certain requests are quite similar, causing the embedding to
2 https://tranco-list.eu/ overlap between the two classes. The result for ATRDF 2023
7

shows no overlap and clear separation between the classes tively contrast our framework with other nearest neighbor
while for CSIC 2010, some records from the two classes were algorithms. This benchmark involved generating algorithm
found to be similar. We identified that most of the anomalous instances based on configuration file written in YAML format
requests which overlap with normal requests actually have no that defines the different methods and algorithms. At the top
malicious payload and are categorized as unintentional illegal level, the point type is specified, followed by the distance
requests. metric, and finally, each algorithm implementation to be eval-
uated. Each implementation specifies the Python library, and
additional entries provide the necessary arguments. For clarity,
an illustrative example of this configuration file is presented
in Figure 6, while the complete file is available in the project
GitHub repository 3 . Both the ”space” and ”run groups” lists
encompass arguments that should be included at the beginning
of every invocation. Each algorithm defines one or more
”run groups,” each of which is expanded into several lists of
arguments. The Cartesian product of these entries results in
numerous argument lists. For instance, consider the hnswlib
entry depicted in Figure 6. This expands into three distinct
algorithm instances: Cosine (Cosine Similarity), L2 (Squared
L2), and IP (Inner product). Each of these instances under-
goes training before being utilized for various experiments.
Initially, experiments are conducted with different values of k,
representing the number of neighbors to return (e.g., [10, 50,
(a) CSIC 2010 100, 300, 400, 500, 1000, 2000, 2500, 3000]). Subsequently,
experiments are conducted with varying ef construction val-
ues, which denote the size of the dynamic list used during
index construction. A larger ef construction value indicates a
higher quality index but also results in longer build times (e.g.,
[10, 20, 40, 80, 120, 200, 400, 600, 800]). Throughout each
run, pertinent information is recorded, including the algorithm
name, the time taken to construct the data structure used
for indexing, and the outcomes of every query. These query
outcomes encompass the neighboring points returned by the
algorithm, the duration required to locate these neighbors, and
the proximity between the neighbors and the query point.
We leveraged the query results to compute the correspond-
ing confusion matrix, enabling us to thoroughly evaluate
the classification-by-retrieval performance of each algorithm.
We conducted our benchmarking analysis by assessing the
performance of various algorithm implementations, including
(b) ATRDF 2023
Nmslib, Hnswlib, Bruteforce Blas, Balltree, KDtree, CKDtree,
Fig. 5: Vector representations reduced to 2D with t-SNE Annoy, Faiss, and RPForest, all of which were evaluated using
the publicly accessible ANN-Benchmarks tool as a reference
Then, we compare our method with fourteen detecting out- framework [65]. In order to gain a deeper comprehension of
lying objects in multivariate data baseline models [64]: Feature how the embedding layer influences the detection outcome,
Bagging (FB), Histogram-based Outlier Detection (HBOS), we assessed all anomaly detection baseline models and the
Isolation Forest (IF), Local Outlier Factor (LOF), Minimum ANN benchmark using two additional prominent language
Covariance Determinant (MCD), One-class SVM (OCSVM), models: BERT [66], and RoBERTa [67]. Each model was
Principal Component Analysis (PCA), Copula-Based Outlier individually trained from scratch and subsequently subjected
Detector (COPOD), Deep One-Class Classification for outlier to evaluation. We employed the RoBERTa model, specifically
detection (DeepSVDD), Clustering-Based Local Outlier Fac- the RoBERTaForMaskedLM class, with a language modeling
tor (CBLOF), Outlier detection based on Gaussian Mixture head on top4 . This model was trained with a maximum
Model (GMM), Kernel Density Estimation (KDE), Linear sequence length of 512, utilizing 12 hidden layers and 12
Model Deviation-based Outlier Detection (LMDD), Quasi- attention heads. The training process spanned 10 epochs,
Monte Carlo Discrepancy Outlier Detection (QMCD). We employing a batch size of 16, which aligns with a similar
apply default hyper-parameters as provided by the original
3 https://github.com/ArielCyber/FT-ANN-Journal/blob/main/ann
source code across all models for consistency and use the
benchmark.yaml
same pre-processed dataset to facilitate optimal comparison. 4 RoBERTa implementation by Hugging Face - https://huggingface.co/docs/
Furthermore, we conducted an ANN benchmark to effec- transformers/model doc/RoBERTa
8

approach in a prior study by [23]. Likewise, we employed in terms of precision and recall, NMSLIB-COS again out-
the BERT model, specifically BertForMaskedLM, which also performs other algorithms with scores of 0.9538 and 0.9954
incorporates a language modeling head on top5 . This model respectively. LMDD appears to exhibit the comparatively
underwent training with a maximum sequence length of 512, weakest performance across multiple metrics. With an F1
utilizing 4 hidden layers and 4 attention heads, as was the case score of 0.3017, a precision of 0.528, and a recall of 0.2293,
in a similar task outlined in [68]. LMDD lags behind the other algorithms. Furthermore, its
training time of 225.8 seconds is considerably longer than
- name: hnswlib most other methods. The poor performance and unfavorable
library: hnswlib tradeoff between training time and results for LMDD could
method: [hnswlib] be attributed to its reliance on a dissimilarity function that
space: [cosine,l2,ip] may not be well-suited to the complex and diverse anomalies
run_groups: present [69].
K: Several observations stand out when considering the frame-
query_args: [[10,50,100,300,400, work performance using BERT and RoBERTa embedding
500,1000,2000,2500,3000]] compared to FastText. It’s noticeable that BERT and RoBERTa
ef_construction: introduce longer training times for all models. For instance,
query_args: [[10, 20, 40, 80, NMSLIB-COS with BERT takes around 0.446 seconds, while
120, 200, 400, 600, 800]] with RoBERTa, it’s approximately 0.563 seconds, compared
Fig. 6: Example configuration for the hnswlib algorithm to the original 0.0552 seconds with FastText. This increase in
training time could be attributed to the more computational
We used Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz complexity nature of BERT [70], [71] and RoBERTa [72]
to evaluate the effectiveness of each technique. Each end- models. In terms of performance, when considering the BERT
point was evaluated separately, and the average score for all embedding, NMSLIB-COS framework continues to exhibit
endpoints was used to measure overall detector performance. robust performance, achieving an F1-score of 0.9675. This
We evaluate model performance by measuring both execution result suggests that the framework effectively leverages the
time and F1-score during training and testing. Training time contextual information embedded within BERT’s representa-
is usually longer than testing time as it entails parameter tions to identify anomalies. The high precision (0.947) and re-
optimization, a computationally intensive task. Conversely, call (0.9982) values further support the framework’s ability to
testing time is relatively faster as it only involves applying the maintain a fine balance between detecting true anomalies and
pre-trained model to new data and predicting their likelihood minimizing false positives. However, a noteworthy observation
of being outliers. It is worth noting that detecting anomalies in lies in the performance of NMSLIB-COS framework when
real-time is critical in mitigating the effects of an attack. There- using RoBERTa embeddings. Surprisingly, while RoBERTa
fore, it is advisable to analyze each phase separately to better is considered an even more advanced and powerful language
understand each model’s performance. Additionally, we aim to model compared to BERT, the F1-score for NMSLIB-COS
achieve better results compared to previous semi-supervised drops slightly to 0.9664. This outcome raises questions about
and unsupervised studies. Supervised studies should not be why the transition to RoBERTa, which typically exhibits supe-
compared since they rely on labeled data. rior performance across range of natural language processing
In the context of ANN search, selecting the optimal value tasks [67], [73], [74], did not lead to an improved perfor-
for K involves determining the number of nearest neighbors mance for this specific outlier detection method. Generally,
to consider for predictions. The model performance evaluation both BERT and RoBERTa maintain performance comparable
and the choice of the optimized value of K rely on the F1 with FastText with some exceptions. NMSLIB-COS with
score, which provides a balanced assessment of precision and BERT achieves an F1-score of 0.9675, a slight improvement
recall. By systematically iterating through a range of K values over FastText’s 0.9713. Similarly, GMM, MCD, CBLOF, and
and assessing the F1 score for each value, we can identify the IFOREST also exhibit consistent or improved F1-scores with
K value that yields the highest F1 score and use it in our final BERT and RoBERTa. However, for DEEPSVDD, the F1-
endpoint evaluation. During this experiment, the performance scores drop slightly with BERT to 0.9187 and even more
of the model was validated by iteratively testing different with RoBERTa to 0.8999. Precision and recall also showcase
values of K, ranging from 1 to 1000. similar trends.
The results for the CSIC 2010 dataset, as can be seen from The ANN baseline models generally perform better than
Table II, comparing against the conventional outlier detection the traditional outlier detection methods, and their perfor-
baselines, NMSLIB-COS (FT-ANN) and PCA demonstrate the mance is quite similar with only slight differences. Among
shortest times, with 0.0552 and 0.0575 seconds respectively, these models, HNSWLIB-IP stands out with the highest F1
whereas AUTOENCODER and LMDD show significantly score (0.9765) and perfect recall (100%), which means it
longer train times of 75.0644 and 225.8 seconds respectively. doesn’t miss any actual anomalies. When we balance train-
NMSLIB-COS achieves the highest F1 score of 0.9713, in- ing time, precision, and recall, the BRUTEFORCE-BLAS
dicating its balanced precision-recall performance. Similarly, models achieve the best precision and recall while needing
5 BERT implementation by Hugging Face - https://huggingface.co/docs/ less training time. This makes them suitable for scenarios
transformers/model doc/bert where quick responses or limited resources are important.
9

On the other hand, models like HNSWLIB-IP and ANNOY- BERT and RoBERTa. The comprehensive mean benchmark
MANHATTAN have slightly longer training times but achieve outcomes of the diverse distance metrics and algorithm param-
slightly higher F1 scores. The relatively slower performance eters are presented in Figure 8. FastText and CKDTREE man-
of the MINKOWSKI distance metric in the BRUTEFORCE- aged to achieve the briefest average index construction times
BLAS model might be due to the more complicated cal- (0.00007 seconds), whereas BALLTREE functioned more than
culations involved in the MINKOWSKI distance, especially twice as slowly even though it utilized the same embeddings.
when dealing with high-dimensional data [75]. The ANNOY Generally, all the algorithms exhibited rapid performance,
model takes a different approach to constructing its search with the exception of ANNOY across all language models.
structure, which might explain its slightly slower training As previously mentioned, all ANN algorithms demonstrated
times compared to other models. ANNOY’s method involves outstanding predictive accuracy.
generating random projections and building binary trees for
its search index, which requires significant computation to
ensure effective projection dimensions that preserve data re-
lationships [76]. Despite not having the highest F1 score, VIII. D ISCUSSION AND C ONCLUSIONS
NMSLIB finds use in industries like Amazon Elasticsearch
Service [77]. This indicates that although its accuracy might This paper suggests an innovative unsupervised few-shot
not be the absolute best compared to other ANN imple- anomaly detection framework that leverages a dedicated
mentations, its simple deployment and easy integration align generic language model for API based on FastText embedding
well with the requirements of real-world applications. The and uses ANN search in a Classification-by-retrieval approach.
overall performance of the algorithm can be evaluated through We showed that API attacks could be easily identified with no
the average outcomes of difference distance metrics and the previous learning. To the best of our knowledge, this is the
algorithm parameters. As depicted in Figure 7, each algorithm first work to utilize a Classification-by-retrieval framework
utilizing FastText embeddings exhibited significantly swifter based on the generalized approach of FastText embeddings
processing times compared to BERT and RoBERTa. FastText combined with the approximate search to find anomalies in
and CKDTREE attained the shortest average index build times API traffic. We present a unique pre-processing technique
(0.00584 seconds), whereas BALLTREE operated over 2.5 to enhance input generalization and simplify API structure.
times slower despite employing the same embeddings. Within This approach encompasses dividing input data into individual
the top 8 fastest FastText algorithms, only BRUTEFORCE tokens, then constraining the vocabulary to a limited set of
demonstrated favorable performance when applied to BERT tokens. Consequently, the API structure becomes streamlined
and RoBERTa, achieving scores of 0.02745 and 0.03215, as the number of unique tokens diminishes, enabling input
respectively. Conversely, ANNOY’s performance was subpar generalization and enables high detection accuracy even with
for each language model. Regarding model accuracy, as il- minimal examples per class. We presented several state-of-the-
lustrated in Figure 7, every algorithm displayed enhanced art models for this task, performed a comparative analysis and
results when utilizing FastText, achieving F1-scores exceeding demonstrated the best accuracy on the CSIC 2010 and ATRDF
97%. Similarly, the remaining algorithms also exhibited strong 2023 datasets. We showcased multiple cutting-edge models
performance with F1-scores surpassing 96%. Generally, our for this objective through two other widely adopted language
framework achieved better accuracy compared to previous models BERT and RoBERTa. Our comprehensive analysis
unsupervised and semi-supervised studies using the CSIC encompassed benchmarking various ANN search algorithms,
2010 dataset, as shown in Table I. where we illustrated our models’ exceptional accuracy on the
The results for the ATRDF 2023 dataset, as can be seen from CSIC 2010 and ATRDF 2023 datasets. While we noted that
Table III, show that we observed a perfect classifier for most our proposed dense cosine distance approach utilizing NM-
of ANN algorithms. With 100% F1-score, recall and preci- SLIB did not yield the top F1 score among ANN algorithms,
sion, our framework exceeded the traditional outlier detection its straightforward implementation and seamless integration
baseline requirements. Most of the models demonstrate the make it a suitable choice for practical applications, particularly
shortest training/building times, whereas AUTOENCODER, those like Amazon Elasticsearch Service that prioritize real-
MCD and DEEPSVDD show significantly longer train times world compatibility. One notable limitation of this and similar
of average 5.3368, 109.6675 and 3.3173 seconds respectively. studies arises from the scarcity of up-to-date and representative
While NMSLIB-COS did not attain the briefest build time, network traffic datasets. Many research efforts rely solely on
it can still be regarded as relatively rapid. It is evident that the CSIC 2010 dataset, raising concerns about the reliabil-
BERT and RoBERTa result in extended training durations ity and applicability of such data, consequently impacting
for the majority of models, with a few exceptions. LMDD, the quality and generalizability of traffic models. While our
CBLOF, QMCD, HBOS, and AUTOENCODER exhibited evaluation focuses on measuring the impact of FT-ANN on
comparatively swifter training completion when utilized with the HTTP dataset, it is important to acknowledge that APIs
BERT and RoBERTa. Most of the traditional outlier detection exist in diverse forms, including REST, GraphQL, gRPC, and
baselines also performed well but struggled to predict all WebSockets, each catering to specific use cases and carrying
positive classes, which affected the precision score. LMDD common vulnerabilities. Our future work entails expanding the
obtained an average 49.41% in F1-Score, which is much worse evaluation to encompass a wider range of API types and to
than those tested. QMCD failed to predict when tested with assess the performance across various loads and volumes.
10

Train Time (Sec) F1-Score Precision Recall


BERT Fast Ro BERT Fast Ro BERT Fast Ro BERT Fast Ro
Text BERTa Text BERTa Text BERTa Text BERTa
HNSWLIB-IP 4.4549 0.0978 3.6719 0.9443 0.9766 0.7548 0.8890 0.9547 0.6062 1.0000 1.0000 1.0000
ANNOY-MANHATTAN 11.426 3.4422 11.924 0.9690 0.9728 0.9683 0.9507 0.9576 0.9496 0.9974 0.9941 0.9973
FAISS 0.3592 0.0473 0.9377 0.9682 0.9725 0.9680 0.9493 0.9571 0.9491 0.9974 0.9940 0.9973
BRUTEFORCE-BLAS-COSINE 0.0191 0.0005 0.0179 0.9682 0.9725 0.9680 0.9493 0.9571 0.9491 0.9974 0.9941 0.9972
BRUTEFORCE-BLAS-CITYBLOCK 0.0132 0.0004 0.0122 0.9682 0.9724 0.9680 0.9493 0.9570 0.9491 0.9974 0.9941 0.9973
BRUTEFORCE-BLAS-MANHATTAN 0.0176 0.0004 0.0165 0.9682 0.9724 0.9680 0.9493 0.9570 0.9491 0.9974 0.9941 0.9973
KDTREE-CITYBLOCK 1.5269 0.0236 1.4969 0.9682 0.9724 0.9680 0.9493 0.9570 0.9491 0.9974 0.9941 0.9973
KDTREE-MANHATTAN 1.5325 0.0285 1.4807 0.9682 0.9724 0.9680 0.9493 0.9570 0.9491 0.9974 0.9941 0.9973
BALLTREE-CITYBLOCK 1.2906 0.0145 1.3317 0.9682 0.9724 0.9680 0.9493 0.9570 0.9491 0.9974 0.9941 0.9973
BALLTREE-MANHATTAN 1.2704 0.0141 1.3043 0.9682 0.9724 0.9680 0.9493 0.9570 0.9491 0.9974 0.9941 0.9973
CKDTREE 0.3944 0.0058 0.4961 0.9682 0.9723 0.9681 0.9493 0.9567 0.9492 0.9974 0.9942 0.9973
KDTREE-EUCLIDEAN 1.5244 0.0214 1.5159 0.9682 0.9723 0.9680 0.9493 0.9567 0.9491 0.9974 0.9942 0.9973
KDTREE-L2 1.5497 0.0250 1.5180 0.9682 0.9723 0.9680 0.9493 0.9567 0.9491 0.9974 0.9942 0.9973
KDTREE-MINKOWSKI 1.5287 0.0347 1.5336 0.9682 0.9723 0.9680 0.9493 0.9567 0.9491 0.9974 0.9942 0.9973
BALLTREE-EUCLIDEAN 1.2903 0.0154 1.2902 0.9682 0.9723 0.9680 0.9493 0.9567 0.9491 0.9974 0.9942 0.9973
BALLTREE-L2 1.2916 0.0153 1.2997 0.9682 0.9723 0.9680 0.9493 0.9567 0.9491 0.9974 0.9942 0.9973
BALLTREE-MINKOWSKI 1.2732 0.0154 1.3031 0.9682 0.9723 0.9680 0.9493 0.9567 0.9491 0.9974 0.9942 0.9973
BRUTEFORCE-BLAS-EUCLIDEAN 0.0108 0.0005 0.0116 0.9682 0.9722 0.9680 0.9493 0.9566 0.9491 0.9974 0.9942 0.9973
BRUTEFORCE-BLAS-MINKOWSKI 0.0765 0.1127 0.1025 0.9682 0.9722 0.9680 0.9493 0.9566 0.9491 0.9974 0.9942 0.9973
NMSLIB-L2 0.9779 0.0473 1.0642 0.9680 0.9721 0.9674 0.9491 0.9568 0.9482 0.9973 0.9936 0.9972
NMSLIB-L1 0.9580 0.0489 1.1008 0.9680 0.9721 0.9676 0.9491 0.9567 0.9488 0.9973 0.9937 0.9969
ANNOY-ANGULAR 13.711 4.1185 13.759 0.9672 0.9720 0.9684 0.9477 0.9564 0.9497 0.9973 0.9941 0.9973
ANNOY-EUCLIDEAN 12.177 4.5989 13.148 0.9680 0.9720 0.9686 0.9491 0.9563 0.9501 0.9974 0.9942 0.9973
HNSWLIB-L2 6.5040 0.6606 6.8610 0.9673 0.9719 0.9659 0.9479 0.9562 0.9458 0.9973 0.9939 0.9970
NMSLIB-COS (FT-ANN) 0.4458 0.0552 0.5626 0.9675 0.9713 0.9664 0.9470 0.9538 0.9465 0.9982 0.9954 0.9972
GMM 3.3663 0.1052 3.3310 0.9380 0.9651 0.9686 0.9290 0.9395 0.9476 0.9494 0.9931 0.9948
HNSWLIB-COSINE 6.4246 0.4612 7.2954 0.9462 0.9642 0.9433 0.9079 0.9414 0.9023 0.9994 0.9978 1.0000
MCD 1269.3 6.3603 1261.3 0.9622 0.9572 0.9517 0.9377 0.9382 0.9159 0.9971 0.9774 0.9985
CBLOF 2.8297 0.5747 2.7095 0.9404 0.9445 0.9365 0.9297 0.9318 0.9276 0.9531 0.9589 0.9477
IFOREST 16.066 0.5734 15.980 0.9397 0.9419 0.9362 0.9289 0.9323 0.9288 0.9525 0.9538 0.9464
HBOS 0.9967 0.8365 1.0126 0.9383 0.9393 0.9367 0.9289 0.9311 0.9274 0.9501 0.9502 0.9483
PCA 27.829 0.0575 28.413 0.9380 0.9391 0.9349 0.9290 0.9304 0.9271 0.9494 0.9504 0.9454
AUTOENCODER 82.873 75.064 84.402 0.9380 0.9389 0.9349 0.9290 0.9303 0.9271 0.9494 0.9502 0.9454
KDE 86.801 3.6710 83.525 0.9370 0.9361 0.9338 0.9283 0.9296 0.9263 0.9483 0.9460 0.9441
OCSVM 45.391 1.6392 50.025 0.9368 0.9360 0.9326 0.9283 0.9295 0.9255 0.9480 0.9459 0.9426
FB 10.875 10.547 10.775 0.9355 0.9354 0.9319 0.9279 0.9303 0.9255 0.9460 0.9444 0.9415
LOF 1.1774 1.0486 1.1147 0.9354 0.9350 0.9317 0.9279 0.9301 0.9255 0.9459 0.9439 0.9412
DEEPSVDD 38.214 32.719 38.242 0.9187 0.9235 0.8999 0.9230 0.9381 0.9155 0.9210 0.9126 0.8954
QMCD 9.9598 0.7783 10.251 0.0000 0.3500 0.0000 0.0000 0.8737 0.0000 0.0000 0.2259 0.0000
LMDD 11883 225.87 11230 0.3292 0.3017 0.3101 0.5701 0.5280 0.5885 0.2476 0.2293 0.2250

TABLE II: Performance comparisons of our framework (FT-ANN) on the CSIC 2010 dataset

Fig. 7: Model level ANN benchmark on the CSIC 2010 dataset


11

Train time (Sec.) F1-score Precision Recall


BERT FAST Ro BERT FAST Ro BERT FAST Ro BERT FAST Ro
TEXT BERTa TEXT BERTa TEXT BERTa TEXT BERTa
ANNOY-ANGULAR 0.9859 0.0908 1.0065 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
KDTREE-CITYBLOCK 0.0007 0.0001 0.0007 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
NMSLIB-L2 0.0026 0.0020 0.0025 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
KDTREE-EUCLIDEAN 0.0008 0.0002 0.0008 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
ANNOY-EUCLIDEAN 0.9808 0.0917 1.0144 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
KDTREE-L2 0.0008 0.0002 0.0008 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
KDTREE-MANHATTAN 0.0007 0.0001 0.0007 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
FAISS 0.0015 0.0010 0.0004 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
NMSLIB-COS (FT-ANN) 0.0039 0.0029 0.0042 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
CKDTREE 0.0005 0.0001 0.0005 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
KDTREE-MINKOWSKI 0.0007 0.0002 0.0008 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
BRUTEFORCE-BLAS-MINKOWSKI 0.0087 0.0028 0.0003 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
BRUTEFORCE-BLAS-MANHATTAN 0.0002 0.0001 0.0002 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
BRUTEFORCE-BLAS-EUCLIDEAN 0.0003 0.0001 0.0002 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
BRUTEFORCE-BLAS-COSINE 0.0002 0.0001 0.0002 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
BRUTEFORCE-BLAS-CITYBLOCK 0.0002 0.0001 0.0002 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
BALLTREE-MINKOWSKI 0.0005 0.0002 0.0005 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
BALLTREE-MANHATTAN 0.0005 0.0001 0.0005 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
BALLTREE-L2 0.0005 0.0001 0.0005 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
BALLTREE-EUCLIDEAN 0.0005 0.0002 0.0005 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
BALLTREE-CITYBLOCK 0.0005 0.0001 0.0005 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
NMSLIB-L1 0.0023 0.0019 0.0026 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
ANNOY-MANHATTAN 0.9944 0.0937 1.0303 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
HNSWLIB-L2 0.0010 0.0005 0.0010 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
DEEPSVDD 3.4848 3.1860 3.2811 0.9953 0.9960 0.9944 0.9908 0.9921 0.9889 1.0000 1.0000 1.0000
GMM 2.1158 0.1115 2.1114 0.9949 0.9953 0.9947 0.9899 0.9907 0.9896 1.0000 1.0000 1.0000
KDE 0.0043 0.0010 0.0044 0.9947 0.9946 0.9947 0.9896 0.9893 0.9896 1.0000 1.0000 1.0000
IFOREST 0.2578 0.1915 0.2597 0.9944 0.9946 0.9944 0.9888 0.9893 0.9888 1.0000 1.0000 1.0000
OCSVM 0.0028 0.0011 0.0028 0.9939 0.9946 0.9941 0.9879 0.9893 0.9884 1.0000 1.0000 1.0000
FB 0.0207 0.0144 0.0210 0.9944 0.9946 0.9944 0.9888 0.9893 0.9888 1.0000 1.0000 1.0000
PCA 0.0132 0.0015 0.0140 0.9951 0.9946 0.9951 0.9903 0.9893 0.9903 1.0000 1.0000 1.0000
AUTOENCODER 5.0431 6.0279 4.9394 0.9951 0.9946 0.9951 0.9903 0.9893 0.9903 1.0000 1.0000 1.0000
LOF 0.0016 0.0014 0.0016 0.9944 0.9946 0.9944 0.9888 0.9893 0.9888 1.0000 1.0000 1.0000
HBOS 0.4646 0.9127 0.4715 0.9938 0.9944 0.9947 0.9878 0.9888 0.9896 1.0000 1.0000 1.0000
MCD 151.30 3.6180 174.05 0.9939 0.9942 0.9886 0.9879 0.9885 0.9887 1.0000 1.0000 0.9898
CBLOF 0.0672 0.4408 0.0609 0.9949 0.9896 0.9957 0.9900 0.9795 0.9916 1.0000 1.0000 1.0000
HNSWLIB-COSINE 0.0011 0.0005 0.0011 0.9835 0.9785 0.9870 0.6119 0.8101 0.4654 1.0000 1.0000 1.0000
HNSWLIB-IP 0.0008 0.0004 0.0008 0.9627 0.9578 0.9866 0.4124 0.8958 0.5951 1.0000 1.0000 1.0000
QMCD 0.0379 0.5941 0.0431 0.0000 0.8946 0.0000 0.0000 0.9832 0.0000 0.0000 0.8240 0.0000
LMDD 0.3028 0.3387 0.3208 0.5641 0.2920 0.6261 1.0000 0.8237 1.0000 0.3982 0.1825 0.4628

TABLE III: Performance comparisons of our framework (FT-ANN) on the ATRDF 2023 dataset

Fig. 8: Model level ANN benchmark on the ATRDF 2023 dataset


12

ACKNOWLEDGMENT [25] Q. Niu and X. Li, “A high-performance web attack detection method
based on CNN-GRU model,” in ITNEC, vol. 1, 2020, pp. 804–808.
This work was supported by the Ariel Cyber Innovation [26] L. Yu, L. Chen, J. Dong, M. Li, L. Liu, B. Zhao, and C. Zhang,
Center in conjunction with the Israel National Cyber Direc- “Detecting malicious web requests using an enhanced textCNN,” in
torate in the Prime Minister’s Office. COMPSAC. IEEE, 2020, pp. 768–777.
[27] A. Moradi Vartouni, S. Mehralian, M. Teshnehlab, and
S. Sedighian Kashi, “Auto-Encoder LSTM Methods for Anomaly-Based
R EFERENCES Web Application Firewall,” International Journal of Information and
Communication Technology Research, vol. 11, no. 3, pp. 49–56, 2019.
[1] S. Balsari, A. Fortenko, J. A. Blaya, A. Gropper, M. Jayaram, [28] A. Moradi Vartouni, M. Shokri, and M. Teshnehlab, “Auto-threshold
R. Matthan, R. Sahasranam, M. Shankar, S. N. Sarbadhikari, B. E. deep SVDD for anomaly-based web application firewall,” 2021.
Bierer et al., “Reimagining Health Data Exchange: An application [29] F. A. Research, “fastText library for efficient learning of word represen-
programming interface–enabled roadmap for India,” Journal of Medical tations and sentence classification,” https://github.com/facebookresearch/
Internet Research, vol. 20, no. 7, p. e10725, 2018. fastText/, 2017, online; accessed 02-December-2017.
[2] J. Ofoeda, R. Boateng, and J. Effah, “Application programming interface [30] B. Bansal and S. Srivastava, “Sentiment classification of online con-
(API) research: A review of the past to inform the future,” IJEIS, vol. 15, sumer reviews using word vector representations,” Procedia Computer
no. 3, pp. 76–95, 2019. Science, vol. 132, pp. 1147–1153, 2018.
[3] A. Mendoza and G. Gu, “Mobile application web API reconnaissance: [31] S. TOPRAK and A. G. YAVUZ, “Web Application Firewall Based on
Web-to-mobile inconsistencies & vulnerabilities,” in SP, 2018, pp. 756– Anomaly Detection Using Deep Learning,” Acta Infologica, vol. 6, no. 2,
769. pp. 219–244, 2022.
[4] C. Benzaid and T. Taleb, “ZSM security: Threat surface and best [32] L. Xiao, S. Matsumoto, T. Ishikawa, and K. Sakurai, “SQL Injection
practices,” IEEE Network, vol. 34, no. 3, pp. 124–133, 2020. Attack Detection Method Using Expectation Criterion,” in 2016 Fourth
[5] IBM, “Innovation in the API economy: Building winning experi- International Symposium on Computing and Networking (CANDAR).
ences and new capabilities to compete,” 2016, https://www.ibm.com/ IEEE, 2016, pp. 649–654.
downloads/cas/OXV3LYLO. [33] Y. E. Seyyar, A. G. Yavuz, and H. M. Ünver, “An attack detection
[6] M. Reddy, API Design for C++. Elsevier, 2011. framework based on BERT and deep learning,” IEEE Access, vol. 10,
[7] R. Sun, Q. Wang, and L. Guo, “Research Towards Key Issues of API pp. 68 633–68 644, 2022.
Security,” in CNCERT, Beijing, China, July 20–21, 2021, pp. 179–192. [34] E. Chávez, G. Navarro, R. Baeza-Yates, and J. L. Marroquı́n, “Searching
[8] M. Coyne, “Zoom’s Big Security Problems Summa- in metric spaces,” ACM Computing Surveys (CSUR), vol. 33, no. 3, pp.
rized,” https://www.forbes.com/sites/marleycoyne/2020/04/03/ 273–321, 2001.
zooms-big-security-problems-summarized/?sh=46fc370f4641, 2020, [35] A. Ponomarenko, N. Avrelin, B. Naidan, and L. Boytsov, “Comparative
forbes. analysis of data structures for approximate nearest neighbor search,” in
[9] “Capital One data breach: Arrest after details of 106m people stolen,” DATA ANALYTICS, 2014, pp. 125–130.
https://www.bbc.com/news/world-us-canada-49159859, 2019, bBC.
[36] P. Indyk and R. Motwani, “Approximate nearest neighbors: towards
[10] J. Greig, “Hilton denies hack after data from 3.7 million Honors
removing the curse of dimensionality,” in ACM Symposium on Theory
customers offered for sale,” https://therecord.media/hilton-denies-hack-
of Computing, 1998, pp. 604–613.
after-data-from-3-7-million-honors-customer-offered-for-sale/, 2023, the
[37] K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, and H. Zhang, “Fast
Record.
approximate nearest-neighbor search with k-nearest neighbor graph,” in
[11] “Equifax Says Cyberattack May Have Affected 143 Million
AI, 2011.
in the U.S.” https://www.nytimes.com/2017/09/07/business/equifax-
[38] R. Battle and E. Benson, “Bridging the semantic Web and Web 2.0
cyberattack.html, 2017, the New York Times.
with representational state transfer (REST),” Journal of Web Semantics,
[12] D. Fett, R. Küsters, and G. Schmitz, “A comprehensive formal security
vol. 6, no. 1, pp. 61–69, 2008.
analysis of OAuth 2.0,” in SIGSAC CCCS, 2016, pp. 1204–1215.
[39] W. J. Buchanan, S. Helme, and A. Woodward, “Analysis of the adoption
[13] A. Chan, A. Kharkar, R. Z. Moghaddam, Y. Mohylevskyy, A. Helyar,
of security headers in HTTP,” IET Information Security, vol. 12, no. 2,
E. Kamal, M. Elkamhawy, and N. Sundaresan, “Transformer-based
pp. 118–126, 2018.
Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or
Fine-tuning?” arXiv preprint arXiv:2306.01754, 2023. [40] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of Tricks for
[14] G. Baye, F. Hussain, A. Oracevic, R. Hussain, and S. A. Kazmi, “API Efficient Text Classification,” arXiv preprint arXiv:1607.01759, 2016.
security in large enterprises: Leveraging machine learning for anomaly [41] B. Li and L. Han, “Distance weighted cosine similarity measure for text
detection,” in ISNCC. IEEE, 2021, pp. 1–6. classification,” in IDEAL, China, Oct., 2013, pp. 611–618.
[15] E. Harlicaj et al., “Anomaly detection of web-based attacks in microser- [42] E. Techapanurak, M. Suganuma, and T. Okatani, “Hyperparameter-free
vices,” 2021. out-of-distribution detection using cosine similarity,” in ACCV, 2020.
[16] F. Shen, Y. Mu, Y. Yang, W. Liu, L. Liu, J. Song, and H. T. Shen, [43] Y.-C. Hsu, Y. Shen, H. Jin, and Z. Kira, “Generalized odin: Detecting
“Classification by retrieval: Binarizing data and classifiers,” in ACM out-of-distribution image without learning from out-of-distribution data,”
SIGIR, 2017, pp. 595–604. in CVF, 2020, pp. 10 951–10 960.
[17] W. Shi, J. Michael, S. Gururangan, and L. Zettlemoyer, “Nearest [44] P. Xia, L. Zhang, and F. Li, “Learning similarity with cosine similarity
neighbor zero-shot inference,” arXiv preprint arXiv:2205.13792, 2022. ensemble,” Information Sciences, vol. 307, pp. 39–52, 2015.
[18] A. M. Qamar, E. Gaussier, J.-P. Chevallet, and J. H. Lim, “Similarity [45] B. Naidan, L. Boytsov, Y. Malkov, and D. Novak, “Non-Metric Space
learning for nearest neighbor classification,” in ICDM. IEEE, 2008, Library (NMSLIB): An efficient similarity search library and a toolkit
pp. 983–988. for evaluation of k-NN methods for generic non-metric spaces,” https:
[19] J. J. Valero-Mas, A. J. Gallego, P. Alonso-Jiménez, and X. Serra, “Mul- //github.com/nmslib/nmslib, 2014, online; accessed 14-July-2014.
tilabel prototype generation for data reduction in k-nearest neighbour [46] J. E. Stone, K. S. Griffin, J. Amstutz, D. E. DeMarle, W. R. Sherman,
classification,” Pattern Recognition, vol. 135, p. 109190, 2023. and J. Günther, “ANARI: A 3-D Rendering API Standard,” Computing
[20] A. S. Reddy and B. Rudra, “Evaluation of Recurrent Neural Networks in Science & Engineering, vol. 24, no. 2, pp. 7–18, 2022.
for Detecting Injections in API Requests,” in CCWC, 2021, pp. 0936– [47] C. Torrano-Gimenez, H. T. Nguyen, G. Alvarez, S. Petrović, and
0941. K. Franke, “Applying feature selection to payload-based web application
[21] J. Ombagi, “Time-Based Blind SQL Injection via HTTP Headers: firewalls,” in IWSCN, 2011, pp. 75–81.
Fuzzing and Exploitation.” 2017. [48] B. Ware, “Analyzing Web Traffic ECML/PKDD 2007 Discovery Chal-
[22] I. Jemal, M. A. Haddar, O. Cheikhrouhou, and A. Mahfoudhi, “Perfor- lenge,” 2007, https://www.lirmm.fr/pkdd2007-challenge/comites.html.
mance evaluation of Convolutional Neural Network for web security,” [49] UNB, “A collaborative project between the Communications Security
Computer Communications, vol. 175, pp. 58–67, 2021. Establishment (CSE) & the Canadian Institute for Cybersecurity (CIC),”
[23] M. Gniewkowski, H. Maciejewski, T. R. Surmacz, and W. Walen- 2018, https://www.unb.ca/cic/datasets/ids-2018.html.
tynowicz, “HTTP2vec: Embedding of HTTP Requests for Detection of [50] B. Damele and M. Stampar, “SQLMAP: Automatic SQL injection and
Anomalous Traffic,” ArXiv, vol. abs/2108.01763, 2021. database takeover tool (2015),” http://sqlmap.org.
[24] I. Jemal, M. A. Haddar, O. Cheikhrouhou, and A. Mahfoudhi, “M-CNN: [51] Icesurfer and Nico, “SQLNINJA: SQL Server injection & takeover tool
a new hybrid deep learning model for web security,” in AICCSA, 2020, (2007),” https://sqlninja.sourceforge.net.
pp. 1–7. [52] S. Bennetts, “OWASP Zed attack proxy,” AppSec USA, 2013.
13

[53] Y. Fang, J. Peng, L. Liu, and C. Huang, “WOVSQLI: Detection of SQL Udi Aharon He is currently pursuing a Ph.D. de-
injection behaviors using word vector and LSTM,” in CSP, 2018, pp. gree within the Department of Computer Science at
170–174. Ariel University, Israel. His research activities span
[54] C. T. Giménez, A. P. Villegas, and G. Á. Marañón, “HTTP data set the fields of Machine Learning and Cybersecurity.
CSIC 2010,” CSIC, vol. 64, 2010. Specifically, the primary focus of his work is on
[55] S. Lavian, R. Dubin, and A. Dvir, “The API Traffic Research enhancing API security through the application of
Dataset Framework (ATRDF),” 2023, https://github.com/ArielCyber/ text-based models.
Cisco Ariel Uni API security challenge.
[56] P. M. S. Sánchez, J. M. J. Valero, A. H. Celdrán, G. Bovet, M. G.
Pérez, and G. M. Pérez, “A survey on device behavior fingerprinting:
Data sources, techniques, application scenarios, and datasets,” IEEE
Communications Surveys & Tutorials, vol. 23, no. 2, pp. 1048–1077, Ran Dubin Received his B.Sc., M.Sc., and Ph.D.
2021. degrees from Ben-Gurion University, Beer Sheva,
[57] B. R. Dawadi, B. Adhikari, and D. K. Srivastava, “Deep Learning Israel, all in communication systems engineering.
Technique-Enabled Web Application Firewall for the Detection of Web He is currently a faculty member at the Computer
Attacks,” Sensors, vol. 23, no. 4, p. 2073, 2023. Science Department, Ariel University, Israel. His
[58] H. Mac, D. Truong, L. Nguyen, H. Nguyen, H. A. Tran, and D. Tran, research interests revolve around zero-trust cyber
“Detecting attacks on web applications using autoencoder,” in ICT, 2018, protection, malware disarms and reconstruction, en-
pp. 416–421. crypted network traffic detection, Deep Packet In-
[59] J. Wang, Z. Zhou, and J. Chen, “Evaluating CNN and LSTM for web spection (DPI), bypassing AI, Natural Language
attack detection,” in ICMLC, 2018, pp. 283–287. Processing, and AI trust.
[60] M. Ito and H. Iyatomi, “Web application firewall using character-level
convolutional neural network,” in CSPA, 2018, pp. 103–106.
[61] I. Jemal, M. A. Haddar, O. Cheikhrouhou, and A. Mahfoudhi, “Perfor- Amit Dvir Amit Dvir received his B.Sc., M.Sc.,
mance evaluation of Convolutional Neural Network for web security,” and Ph.D. degrees from Ben-Gurion University, Beer
Computer Communications, vol. 175, pp. 58–67, 2021. Sheva, Israel, all in communication systems engi-
[62] L. Yan and J. Xiong, “Web-APT-Detect: a framework for web-based neering. He is currently a Faculty Member in the
advanced persistent threat detection using self-translation machine with Department of Computer Science and the head of
attention,” IEEE Letters of the Computer Society, vol. 3, no. 2, pp. 66– the Ariel Cyber Innovation Center, Ariel University,
69, 2020. Israel. From 2011 to 2012, he was a Postdoctoral
[63] V. L. Pochat, T. Van Goethem, S. Tajalizadehkhoob, M. Korczyński, Fellow at the Laboratory of Cryptography and Sys-
and W. Joosen, “Tranco: A research-oriented top sites ranking hardened tem Security, Budapest, Hungary. His research inter-
against manipulation,” arXiv preprint arXiv:1806.01156, 2018. ests include enrichment data from encrypted traffic.
[64] Y. Zhao, Z. Nasrullah, and Z. Li, “PyOD: A python toolbox for scalable
outlier detection,” arXiv preprint arXiv:1901.01588, 2019.
Chen Hajaj Chen Hajaj holds Ph.D. (Computer
[65] M. Aumüller, E. Bernhardsson, and A. Faithfull, “ANN-Benchmarks:
Science), M.Sc. (Electrical Engineering), and B.Sc.
A benchmarking tool for approximate nearest neighbor algorithms,”
(Computer Engineering) degrees, all from Bar-Ilan
Information Systems, vol. 87, p. 101374, 2020.
University. He is a faculty member in the Depart-
[66] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
ment of Industrial Engineering and Management, the
of deep bidirectional transformers for language understanding,” arXiv
head of the Data Science and Artificial Intelligence
preprint arXiv:1810.04805, 2018.
Research Center, and a member of the Ariel Cyber
[67] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
Innovation Center. From 2016 to 2018, Chen was a
L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized
postdoctoral fellow at Vanderbilt University. Chen’s
BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
research activities are in the areas of Machine Learn-
[68] H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,”
ing, Game Theory, and Cybersecurity. Specifically,
in 2021 International Joint Conference on Neural Networks (IJCNN).
the focuses of his work are on encrypted traffic classification, how to detect
IEEE, 2021, pp. 1–8.
and robustify the weak spots of AI methods (adversarial artificial intelligence),
[69] A. Arning, R. Agrawal, and P. Raghavan, “A Linear Method for
and multimodal classification techniques.
Deviation Detection in Large Databases.” in KDD, vol. 1141, no. 50,
1996, pp. 972–981.
[70] T. Yu, H. Fei, and P. Li, “U-BERT for Fast and Scalable Text-
Image Retrieval,” in Proceedings of the 2022 ACM SIGIR International
Conference on Theory of Information Retrieval, 2022, pp. 193–203.
[71] J. Xin, R. Tang, Y. Yu, and J. Lin, “BERxiT: Early exiting for BERT
with better fine-tuning and extension to regression,” in Proceedings of
the 16th Conference of the European Chapter of the Association for
Computational Linguistics: Main Volume, 2021, pp. 91–104.
[72] A. Rücklé, G. Geigle, M. Glockner, T. Beck, J. Pfeiffer, N. Reimers,
and I. Gurevych, “AdapterDrop: On the efficiency of adapters in trans-
formers,” arXiv preprint arXiv:2010.11918, 2020.
[73] I. Tarunesh, S. Aditya, and M. Choudhury, “Trusting RoBERTa over
BERT: Insights from checklisting the natural language inference task,”
arXiv preprint arXiv:2107.07229, 2021.
[74] P. Rajapaksha, R. Farahbakhsh, and N. Crespi, “BERT, XLNet or
RoBERTa: the best transfer learning model to detect clickbaits,” IEEE
Access, vol. 9, pp. 154 704–154 716, 2021.
[75] Y. Jia, “Design of nearest neighbor search for dynamic interaction
points,” in 2021 2nd International Conference on Big Data and In-
formatization Education (ICBDIE). IEEE, 2021, pp. 389–393.
[76] F. Cheng, R. J. Hyndman, and A. Panagiotelis, “Manifold learning with
approximate nearest neighbors,” ArXiv, 2021.
[77] AWS, “Build k-Nearest Neighbor (k-NN) similarity
search engine with Amazon Elasticsearch Service,”
2020, https://aws.amazon.com/about-aws/whats-new/2020/03/
build-k-nearest-neighbor-similarity-search-engine-with-amazon-elasticsearch-service.

You might also like