Few-Shot API Attack Anomaly Detection in a Classification-by-Retrieval Framework
Few-Shot API Attack Anomaly Detection in a Classification-by-Retrieval Framework
Abstract—Application Programming Interface (API) attacks adoption of web APIs has heightened the potential of user
refer to the unauthorized or malicious use of APIs, which are safety and privacy breaches, making APIs a prime target for
often exploited to gain access to sensitive data or manipulate on- cyber attackers [3]. In recent years, there have been several
arXiv:2405.11247v1 [cs.CR] 18 May 2024
models to compute word representations. Although CBOW API applications where rapid response times are essential.
learns faster than Skip-gram, Skip-gram outperforms CBOW Additionally, ANN methods are adaptable to high-dimensional
on small datasets [30]. FastText adopts a character n-gram spaces [34], [37], where traditional NN searches can suffer
approach to tokenize words, which effectively tackles Out-of- from the ”curse of dimensionality”, in which the computational
vocabulary (OOV) problems. This method not only generates requirements for exact NN search become prohibitively high
embeddings for common words but also for rare, misspelled, as the dimensionality of the dataset increases [36].
or previously unseen words in the training corpus. In the The framework proposed (depicted in Figure 1) comprises a
realm of API security, it is imperative to pay attention to the combination of a FastText embedding network and a retrieval
internal structure of words in order to grasp the intent and layer, which includes an ANN matching component and a
context behind API requests. API security entails scrutinizing result aggregation component built upon it, forming the FT-
the textual content within API requests to pinpoint potential ANN system. In Phase 1 (Figure 1), we initiate the process
threats or vulnerabilities [31]. These security threats could be by training a generic language model to serve as a reliable
concealed within apparently harmless text, which may involve baseline for the detection model. To gather a substantial
the use of uncommon or previously unseen terms, or even dataset, we conducted web crawling on a random sample of
attempts to obscure their intent through spelling errors [32]. websites from the Tranco top websites list, collecting over a
Moreover, APIs might be required to accommodate a diverse million examples of normal web traffic. This language model
range of languages, including those characterized by intricate now serves as a pre-trained baseline, applicable to any real-
morphological structures [33]. world API anomaly detection system.
Prior works have proposed various methods for detecting Proceeding to Phase 2 (Figure 1), the unsupervised detection
anomalies. While the majority of these approaches focus on model is trained, consisting of a pre-processing step and a
classic classification tasks, we propose the utilization of the Classification-by-retrieval framework. The pre-processing step
ANN vector similarity method for identifying in-distribution validates the HTTP headers against the standard structure. Ad-
records. The similarity search concept pertains to the pro- ditionally, a unique data transformation is applied to simplify
cess of identifying data points in a dataset that demonstrate the API vocabulary. The detection model is trained solely on
similarities with a specific pattern, commonly referred to as normal API traffic, and constructs a single ANN model for all
a query [34]. To measure the similarity of a pair of data endpoints. The term ”endpoint” refers to any data or metadata
points, a distance function is used, where a small distance that may represent a type, an origin, or identification of a
indicates that the two points are more similar or ”closer” to respective API request. Endpoint is defined as the combination
each other [35]. The NN search is a specific type of similarity of the method, host and path [38]. For each endpoint, a
search used to identify data points that are nearest in distance threshold is calculated and utilized during the detection stage.
to a provided query point [36]. The ANN allows search despite
Finally, in Phase 3 (Figure 1), the model’s performance is
the possibility of not retrieving all neighbors in a metric
validated by allowing an index search for every request. Dur-
space [36].
ing validation, we standardize the request using the same pre-
While both ANN and traditional NN are rooted in the fun- processing stages applied during the detection model training
damental concept of similarity search, traditional NN search phase. The text is then transformed into a vector represen-
involves an exhaustive examination of all data points to find tation using our generic language model through inference.
the NN, which can be prohibitively time-consuming for large These embedding vectors are used in the ANN search, which
datasets [36]. In contrast, ANN employs techniques that trade returns the K-NNs from the specific endpoint index collection,
off a slight loss in precision for substantial gains in speed, en- followed by a maximum distance scaling layer. The final score
abling the identification of ANN without examining every data obtained enables the model to ascertain whether the incoming
point. This efficiency becomes especially crucial in real-world API request is normal or an anomaly by comparing with the
4
pre-defined threshold for each API endpoint. enabling it to describe various transactions in a consistent
format without sacrificing their original meanings. In fact, in
A. Data Pre-Processing the majority of cases, these requests have been intelligently
For both training and inferring the detection model, we merged into a single text representation. This consolidation
employ a unique pre-processing technique consisting of three not only streamlines the data but also ensures that the es-
phases, as depicted in Figure 1 in steps 2.1 and 3.1. First, sential information pertaining to different transactions remains
decoding URL special characters, decompressing request body preserved.
content, and converting every character within the request
data string into lowercase. Then, validating request headers
to ensure they are formatted correctly and extracting the
endpoint definition, as depicted in Figure 1 2.1.1 and 3.1.1.
API request headers typically provide information about the
request context, supply authentication credentials, and provide
information about the client (e.g., a person, a computing
device, and/or a browser application) that had initiated the API
[39]. API request header fields are typically derived from a
limited set of options. Accordingly, the data preparation also
uses a fixed set of rules to validate the content of request
headers and filter-out headers according to these rules. Headers
that include valid or approved strings may be transferred to
their destination as an API request and may be excluded
from additional processing. For example, request headers may
include host strings, which specify host or Internet Protocol
(IP) addresses and/or port numbers of a server to which the
API request is being sent. Valid IPv4 syntax should be in the
format of: (0 ≤ n < 256).(0 ≤ n < 256).(0 ≤ n < 256).(0 ≤
n < 256) : (1 ≤ n ≤ 65535). Fig. 2: Data pre-processing steps
Lastly, as shown in Figure 1 2.1.2 and 3.1.2, we con-
vert received requests into abstracted versions based on a
conversion schema. For instance, we replace non-numeric
single characters with the string ”chr”, which serves as a B. Unsupervised ML Language Model
representative, abstract version of the original request string. Our framework leverages FastText for unsupervised learn-
Another example involves converting non-textual symbols into ing. FastText possesses the capability to encapsulate substan-
predefined textual strings. for instance, colons (”:”) may be tive knowledge about words while integrating morphology
converted to the string ”colon”. During the pre-processing details, a crucial aspect for API attack detection. While deep
stage, the API language has been refined to achieve optimally, learning models have excelled state-of-the-art results across
5
various NLP tasks, to the best of our knowledge, no previous with increments of 0.1. The model considers the balance
NLP pre-trained model on the API traffic domain has been between precision and recall, as captured by the F1 score,
publicly published. In the training phase of the method, we to determine the optimal threshold value. As described in
built a single generic FastText language model (based on [29]) Figure 3, in the detection stage, the first position score with
from scratch using the normal API traffic collected by crawling the max normalized value is compared to the best threshold
Tranco’s list of the most popular websites, as can be seen in to determine anomaly.
Figure 1 Phase 1. For training the generic language model, we
used the default hyper-parameters of [29], which encompass
a learning rate of 5%, a word vector size of 100, and a context
window size of 5. The model was then trained for 5 epochs.
We utilized the CBOW model as the dataset is relatively
large, and CBOW embeddings are precise enough for anomaly
detection and computed in a shorter time than skip-gram [40].
For training the detection model and inferencing, we extract
the vector representation of words for every input line.
C. ANN
We obtained ANN to identify normal representation of an Fig. 3: Features extraction and classification-by-retrieval
API endpoint. We train a single detection model, which is
used to describe a normal representation of all endpoints.
During the detection model training stage, each API request V. E XPERIMENTAL D ESIGN
is represented as a vector in the textual embedding space, API security papers are not as prevalent as the technology
including endpoint information. We employ cosine distance itself, despite being one of the most influential technologies
to measure the similarity between data points as it has been [46]. This is particularly evident in the scarcity of ready-to-
applied in numerous text mining endeavors, such as text use publicly available API datasets [47]. Several of these
classification, and information retrieval [41]. Additionally, datasets are obsolete and unsuitable for usage, with some
it has been proven to be effective for Out-of-Distribution lacking traffic diversity and volumes, and others failing to
(OOD) detection tasks [42], [43]. Cosine similarity is a encompass a variety of attacks, such as ECML/PKDD 2007
widely used measure of similarity that calculates the angle [48] and CSE-CIC-IDS 2018 [49].Consequently, researchers
formed by a pair of vectors. When measuring the similarity resort to creating customized datasets by primarily utilizing
between two patterns, the Euclidean distance increases as they open-source vulnerable web applications like DVWA, BWAPP,
become less similar, while the cosine similarity increases as and Mutillidae, and employ automated penetration tools such
they become more similar. Unlike Euclidean distance, cosine as SQLMAP [50], SQLNINJA [51], and OWASP ZAP [52]
similarity is unaffected by the magnitude of the vectors being to gather malicious payloads [53]. Therefore, this research
compared [44]. The embedding vector feeds the Hierarchical is evaluated on two datasets: CSIC 2010 [54] and ATRDF
Navigable Small Worlds (HNSW) graph [45] to build new 2023 [55]. The HTTP CSIC 2010 dataset [54] is widely
indexes of data points. During the detection stage, the model used [23]–[27], [33], [56]–[62] in the field of malicious
compares new API vectors of incoming API requests in web traffic detection. This dataset was created by the Spanish
relation to API vectors of the same API endpoint information Research National Council (CSIC). It is a sample of the traffic
to evaluate normality or anomaly of the incoming API request. occurring on the Spanish e-commerce web application. The
The comparison is made by querying the similarity between dataset includes attacks such as SQL injection (Figure 4a),
the input vector to the closest k objects. buffer overflow, information gathering, files disclosure, CRLF
The ANN search returns a set of IDs representing neighbors’ Injection, XSS, static attacks and unintentional illegal requests.
points and the similarity score between the given point and While unintentional illegal requests lack malicious intent, they
its ID. Max distance scaling is employed to scale the ANN deviate from the typical behavior of the web application and
similarity score within the given range. For every score, the exhibit a different structure compared to regular parameter
maximum value gets transformed into a 0, and every other values. For instance, as shown in Figure 4b, an invalid DNI
value is divided by the maximum similarity score in the range (Spanish national ID number) was marked as an anomaly. We
and then subtracted from 1. We use this method to invert the divided the dataset into two segments: the training portion,
relationship between the original and the normalized scores which included 36,000 normal requests and was exclusively
to emphasize higher scores for smaller values. As in the utilized for representation learning, and the inference portion,
cosine space, a smaller distance indicates that the two points which comprised both 36,000 normal and 25,000 anomalous
are closer to each other. Let X be the similarity
score, the traffic that was encoded by the model and employed for
′ X
normalized score X’ is: X = 1 − max(X) detection. It has 38 different endpoints, 8 of which have no
Lastly, we suggest an adaptive search for the best threshold normal representation and were excluded from our experiment.
for each API endpoint. As part of the detection model training, The API Traffic Research Dataset Framework (ATRDF) [55]
the model iteratively evaluates thresholds between 0 and 1, is a recently published HTTP dataset publicly available which
6
shows no overlap and clear separation between the classes tively contrast our framework with other nearest neighbor
while for CSIC 2010, some records from the two classes were algorithms. This benchmark involved generating algorithm
found to be similar. We identified that most of the anomalous instances based on configuration file written in YAML format
requests which overlap with normal requests actually have no that defines the different methods and algorithms. At the top
malicious payload and are categorized as unintentional illegal level, the point type is specified, followed by the distance
requests. metric, and finally, each algorithm implementation to be eval-
uated. Each implementation specifies the Python library, and
additional entries provide the necessary arguments. For clarity,
an illustrative example of this configuration file is presented
in Figure 6, while the complete file is available in the project
GitHub repository 3 . Both the ”space” and ”run groups” lists
encompass arguments that should be included at the beginning
of every invocation. Each algorithm defines one or more
”run groups,” each of which is expanded into several lists of
arguments. The Cartesian product of these entries results in
numerous argument lists. For instance, consider the hnswlib
entry depicted in Figure 6. This expands into three distinct
algorithm instances: Cosine (Cosine Similarity), L2 (Squared
L2), and IP (Inner product). Each of these instances under-
goes training before being utilized for various experiments.
Initially, experiments are conducted with different values of k,
representing the number of neighbors to return (e.g., [10, 50,
(a) CSIC 2010 100, 300, 400, 500, 1000, 2000, 2500, 3000]). Subsequently,
experiments are conducted with varying ef construction val-
ues, which denote the size of the dynamic list used during
index construction. A larger ef construction value indicates a
higher quality index but also results in longer build times (e.g.,
[10, 20, 40, 80, 120, 200, 400, 600, 800]). Throughout each
run, pertinent information is recorded, including the algorithm
name, the time taken to construct the data structure used
for indexing, and the outcomes of every query. These query
outcomes encompass the neighboring points returned by the
algorithm, the duration required to locate these neighbors, and
the proximity between the neighbors and the query point.
We leveraged the query results to compute the correspond-
ing confusion matrix, enabling us to thoroughly evaluate
the classification-by-retrieval performance of each algorithm.
We conducted our benchmarking analysis by assessing the
performance of various algorithm implementations, including
(b) ATRDF 2023
Nmslib, Hnswlib, Bruteforce Blas, Balltree, KDtree, CKDtree,
Fig. 5: Vector representations reduced to 2D with t-SNE Annoy, Faiss, and RPForest, all of which were evaluated using
the publicly accessible ANN-Benchmarks tool as a reference
Then, we compare our method with fourteen detecting out- framework [65]. In order to gain a deeper comprehension of
lying objects in multivariate data baseline models [64]: Feature how the embedding layer influences the detection outcome,
Bagging (FB), Histogram-based Outlier Detection (HBOS), we assessed all anomaly detection baseline models and the
Isolation Forest (IF), Local Outlier Factor (LOF), Minimum ANN benchmark using two additional prominent language
Covariance Determinant (MCD), One-class SVM (OCSVM), models: BERT [66], and RoBERTa [67]. Each model was
Principal Component Analysis (PCA), Copula-Based Outlier individually trained from scratch and subsequently subjected
Detector (COPOD), Deep One-Class Classification for outlier to evaluation. We employed the RoBERTa model, specifically
detection (DeepSVDD), Clustering-Based Local Outlier Fac- the RoBERTaForMaskedLM class, with a language modeling
tor (CBLOF), Outlier detection based on Gaussian Mixture head on top4 . This model was trained with a maximum
Model (GMM), Kernel Density Estimation (KDE), Linear sequence length of 512, utilizing 12 hidden layers and 12
Model Deviation-based Outlier Detection (LMDD), Quasi- attention heads. The training process spanned 10 epochs,
Monte Carlo Discrepancy Outlier Detection (QMCD). We employing a batch size of 16, which aligns with a similar
apply default hyper-parameters as provided by the original
3 https://github.com/ArielCyber/FT-ANN-Journal/blob/main/ann
source code across all models for consistency and use the
benchmark.yaml
same pre-processed dataset to facilitate optimal comparison. 4 RoBERTa implementation by Hugging Face - https://huggingface.co/docs/
Furthermore, we conducted an ANN benchmark to effec- transformers/model doc/RoBERTa
8
approach in a prior study by [23]. Likewise, we employed in terms of precision and recall, NMSLIB-COS again out-
the BERT model, specifically BertForMaskedLM, which also performs other algorithms with scores of 0.9538 and 0.9954
incorporates a language modeling head on top5 . This model respectively. LMDD appears to exhibit the comparatively
underwent training with a maximum sequence length of 512, weakest performance across multiple metrics. With an F1
utilizing 4 hidden layers and 4 attention heads, as was the case score of 0.3017, a precision of 0.528, and a recall of 0.2293,
in a similar task outlined in [68]. LMDD lags behind the other algorithms. Furthermore, its
training time of 225.8 seconds is considerably longer than
- name: hnswlib most other methods. The poor performance and unfavorable
library: hnswlib tradeoff between training time and results for LMDD could
method: [hnswlib] be attributed to its reliance on a dissimilarity function that
space: [cosine,l2,ip] may not be well-suited to the complex and diverse anomalies
run_groups: present [69].
K: Several observations stand out when considering the frame-
query_args: [[10,50,100,300,400, work performance using BERT and RoBERTa embedding
500,1000,2000,2500,3000]] compared to FastText. It’s noticeable that BERT and RoBERTa
ef_construction: introduce longer training times for all models. For instance,
query_args: [[10, 20, 40, 80, NMSLIB-COS with BERT takes around 0.446 seconds, while
120, 200, 400, 600, 800]] with RoBERTa, it’s approximately 0.563 seconds, compared
Fig. 6: Example configuration for the hnswlib algorithm to the original 0.0552 seconds with FastText. This increase in
training time could be attributed to the more computational
We used Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz complexity nature of BERT [70], [71] and RoBERTa [72]
to evaluate the effectiveness of each technique. Each end- models. In terms of performance, when considering the BERT
point was evaluated separately, and the average score for all embedding, NMSLIB-COS framework continues to exhibit
endpoints was used to measure overall detector performance. robust performance, achieving an F1-score of 0.9675. This
We evaluate model performance by measuring both execution result suggests that the framework effectively leverages the
time and F1-score during training and testing. Training time contextual information embedded within BERT’s representa-
is usually longer than testing time as it entails parameter tions to identify anomalies. The high precision (0.947) and re-
optimization, a computationally intensive task. Conversely, call (0.9982) values further support the framework’s ability to
testing time is relatively faster as it only involves applying the maintain a fine balance between detecting true anomalies and
pre-trained model to new data and predicting their likelihood minimizing false positives. However, a noteworthy observation
of being outliers. It is worth noting that detecting anomalies in lies in the performance of NMSLIB-COS framework when
real-time is critical in mitigating the effects of an attack. There- using RoBERTa embeddings. Surprisingly, while RoBERTa
fore, it is advisable to analyze each phase separately to better is considered an even more advanced and powerful language
understand each model’s performance. Additionally, we aim to model compared to BERT, the F1-score for NMSLIB-COS
achieve better results compared to previous semi-supervised drops slightly to 0.9664. This outcome raises questions about
and unsupervised studies. Supervised studies should not be why the transition to RoBERTa, which typically exhibits supe-
compared since they rely on labeled data. rior performance across range of natural language processing
In the context of ANN search, selecting the optimal value tasks [67], [73], [74], did not lead to an improved perfor-
for K involves determining the number of nearest neighbors mance for this specific outlier detection method. Generally,
to consider for predictions. The model performance evaluation both BERT and RoBERTa maintain performance comparable
and the choice of the optimized value of K rely on the F1 with FastText with some exceptions. NMSLIB-COS with
score, which provides a balanced assessment of precision and BERT achieves an F1-score of 0.9675, a slight improvement
recall. By systematically iterating through a range of K values over FastText’s 0.9713. Similarly, GMM, MCD, CBLOF, and
and assessing the F1 score for each value, we can identify the IFOREST also exhibit consistent or improved F1-scores with
K value that yields the highest F1 score and use it in our final BERT and RoBERTa. However, for DEEPSVDD, the F1-
endpoint evaluation. During this experiment, the performance scores drop slightly with BERT to 0.9187 and even more
of the model was validated by iteratively testing different with RoBERTa to 0.8999. Precision and recall also showcase
values of K, ranging from 1 to 1000. similar trends.
The results for the CSIC 2010 dataset, as can be seen from The ANN baseline models generally perform better than
Table II, comparing against the conventional outlier detection the traditional outlier detection methods, and their perfor-
baselines, NMSLIB-COS (FT-ANN) and PCA demonstrate the mance is quite similar with only slight differences. Among
shortest times, with 0.0552 and 0.0575 seconds respectively, these models, HNSWLIB-IP stands out with the highest F1
whereas AUTOENCODER and LMDD show significantly score (0.9765) and perfect recall (100%), which means it
longer train times of 75.0644 and 225.8 seconds respectively. doesn’t miss any actual anomalies. When we balance train-
NMSLIB-COS achieves the highest F1 score of 0.9713, in- ing time, precision, and recall, the BRUTEFORCE-BLAS
dicating its balanced precision-recall performance. Similarly, models achieve the best precision and recall while needing
5 BERT implementation by Hugging Face - https://huggingface.co/docs/ less training time. This makes them suitable for scenarios
transformers/model doc/bert where quick responses or limited resources are important.
9
On the other hand, models like HNSWLIB-IP and ANNOY- BERT and RoBERTa. The comprehensive mean benchmark
MANHATTAN have slightly longer training times but achieve outcomes of the diverse distance metrics and algorithm param-
slightly higher F1 scores. The relatively slower performance eters are presented in Figure 8. FastText and CKDTREE man-
of the MINKOWSKI distance metric in the BRUTEFORCE- aged to achieve the briefest average index construction times
BLAS model might be due to the more complicated cal- (0.00007 seconds), whereas BALLTREE functioned more than
culations involved in the MINKOWSKI distance, especially twice as slowly even though it utilized the same embeddings.
when dealing with high-dimensional data [75]. The ANNOY Generally, all the algorithms exhibited rapid performance,
model takes a different approach to constructing its search with the exception of ANNOY across all language models.
structure, which might explain its slightly slower training As previously mentioned, all ANN algorithms demonstrated
times compared to other models. ANNOY’s method involves outstanding predictive accuracy.
generating random projections and building binary trees for
its search index, which requires significant computation to
ensure effective projection dimensions that preserve data re-
lationships [76]. Despite not having the highest F1 score, VIII. D ISCUSSION AND C ONCLUSIONS
NMSLIB finds use in industries like Amazon Elasticsearch
Service [77]. This indicates that although its accuracy might This paper suggests an innovative unsupervised few-shot
not be the absolute best compared to other ANN imple- anomaly detection framework that leverages a dedicated
mentations, its simple deployment and easy integration align generic language model for API based on FastText embedding
well with the requirements of real-world applications. The and uses ANN search in a Classification-by-retrieval approach.
overall performance of the algorithm can be evaluated through We showed that API attacks could be easily identified with no
the average outcomes of difference distance metrics and the previous learning. To the best of our knowledge, this is the
algorithm parameters. As depicted in Figure 7, each algorithm first work to utilize a Classification-by-retrieval framework
utilizing FastText embeddings exhibited significantly swifter based on the generalized approach of FastText embeddings
processing times compared to BERT and RoBERTa. FastText combined with the approximate search to find anomalies in
and CKDTREE attained the shortest average index build times API traffic. We present a unique pre-processing technique
(0.00584 seconds), whereas BALLTREE operated over 2.5 to enhance input generalization and simplify API structure.
times slower despite employing the same embeddings. Within This approach encompasses dividing input data into individual
the top 8 fastest FastText algorithms, only BRUTEFORCE tokens, then constraining the vocabulary to a limited set of
demonstrated favorable performance when applied to BERT tokens. Consequently, the API structure becomes streamlined
and RoBERTa, achieving scores of 0.02745 and 0.03215, as the number of unique tokens diminishes, enabling input
respectively. Conversely, ANNOY’s performance was subpar generalization and enables high detection accuracy even with
for each language model. Regarding model accuracy, as il- minimal examples per class. We presented several state-of-the-
lustrated in Figure 7, every algorithm displayed enhanced art models for this task, performed a comparative analysis and
results when utilizing FastText, achieving F1-scores exceeding demonstrated the best accuracy on the CSIC 2010 and ATRDF
97%. Similarly, the remaining algorithms also exhibited strong 2023 datasets. We showcased multiple cutting-edge models
performance with F1-scores surpassing 96%. Generally, our for this objective through two other widely adopted language
framework achieved better accuracy compared to previous models BERT and RoBERTa. Our comprehensive analysis
unsupervised and semi-supervised studies using the CSIC encompassed benchmarking various ANN search algorithms,
2010 dataset, as shown in Table I. where we illustrated our models’ exceptional accuracy on the
The results for the ATRDF 2023 dataset, as can be seen from CSIC 2010 and ATRDF 2023 datasets. While we noted that
Table III, show that we observed a perfect classifier for most our proposed dense cosine distance approach utilizing NM-
of ANN algorithms. With 100% F1-score, recall and preci- SLIB did not yield the top F1 score among ANN algorithms,
sion, our framework exceeded the traditional outlier detection its straightforward implementation and seamless integration
baseline requirements. Most of the models demonstrate the make it a suitable choice for practical applications, particularly
shortest training/building times, whereas AUTOENCODER, those like Amazon Elasticsearch Service that prioritize real-
MCD and DEEPSVDD show significantly longer train times world compatibility. One notable limitation of this and similar
of average 5.3368, 109.6675 and 3.3173 seconds respectively. studies arises from the scarcity of up-to-date and representative
While NMSLIB-COS did not attain the briefest build time, network traffic datasets. Many research efforts rely solely on
it can still be regarded as relatively rapid. It is evident that the CSIC 2010 dataset, raising concerns about the reliabil-
BERT and RoBERTa result in extended training durations ity and applicability of such data, consequently impacting
for the majority of models, with a few exceptions. LMDD, the quality and generalizability of traffic models. While our
CBLOF, QMCD, HBOS, and AUTOENCODER exhibited evaluation focuses on measuring the impact of FT-ANN on
comparatively swifter training completion when utilized with the HTTP dataset, it is important to acknowledge that APIs
BERT and RoBERTa. Most of the traditional outlier detection exist in diverse forms, including REST, GraphQL, gRPC, and
baselines also performed well but struggled to predict all WebSockets, each catering to specific use cases and carrying
positive classes, which affected the precision score. LMDD common vulnerabilities. Our future work entails expanding the
obtained an average 49.41% in F1-Score, which is much worse evaluation to encompass a wider range of API types and to
than those tested. QMCD failed to predict when tested with assess the performance across various loads and volumes.
10
TABLE II: Performance comparisons of our framework (FT-ANN) on the CSIC 2010 dataset
TABLE III: Performance comparisons of our framework (FT-ANN) on the ATRDF 2023 dataset
ACKNOWLEDGMENT [25] Q. Niu and X. Li, “A high-performance web attack detection method
based on CNN-GRU model,” in ITNEC, vol. 1, 2020, pp. 804–808.
This work was supported by the Ariel Cyber Innovation [26] L. Yu, L. Chen, J. Dong, M. Li, L. Liu, B. Zhao, and C. Zhang,
Center in conjunction with the Israel National Cyber Direc- “Detecting malicious web requests using an enhanced textCNN,” in
torate in the Prime Minister’s Office. COMPSAC. IEEE, 2020, pp. 768–777.
[27] A. Moradi Vartouni, S. Mehralian, M. Teshnehlab, and
S. Sedighian Kashi, “Auto-Encoder LSTM Methods for Anomaly-Based
R EFERENCES Web Application Firewall,” International Journal of Information and
Communication Technology Research, vol. 11, no. 3, pp. 49–56, 2019.
[1] S. Balsari, A. Fortenko, J. A. Blaya, A. Gropper, M. Jayaram, [28] A. Moradi Vartouni, M. Shokri, and M. Teshnehlab, “Auto-threshold
R. Matthan, R. Sahasranam, M. Shankar, S. N. Sarbadhikari, B. E. deep SVDD for anomaly-based web application firewall,” 2021.
Bierer et al., “Reimagining Health Data Exchange: An application [29] F. A. Research, “fastText library for efficient learning of word represen-
programming interface–enabled roadmap for India,” Journal of Medical tations and sentence classification,” https://github.com/facebookresearch/
Internet Research, vol. 20, no. 7, p. e10725, 2018. fastText/, 2017, online; accessed 02-December-2017.
[2] J. Ofoeda, R. Boateng, and J. Effah, “Application programming interface [30] B. Bansal and S. Srivastava, “Sentiment classification of online con-
(API) research: A review of the past to inform the future,” IJEIS, vol. 15, sumer reviews using word vector representations,” Procedia Computer
no. 3, pp. 76–95, 2019. Science, vol. 132, pp. 1147–1153, 2018.
[3] A. Mendoza and G. Gu, “Mobile application web API reconnaissance: [31] S. TOPRAK and A. G. YAVUZ, “Web Application Firewall Based on
Web-to-mobile inconsistencies & vulnerabilities,” in SP, 2018, pp. 756– Anomaly Detection Using Deep Learning,” Acta Infologica, vol. 6, no. 2,
769. pp. 219–244, 2022.
[4] C. Benzaid and T. Taleb, “ZSM security: Threat surface and best [32] L. Xiao, S. Matsumoto, T. Ishikawa, and K. Sakurai, “SQL Injection
practices,” IEEE Network, vol. 34, no. 3, pp. 124–133, 2020. Attack Detection Method Using Expectation Criterion,” in 2016 Fourth
[5] IBM, “Innovation in the API economy: Building winning experi- International Symposium on Computing and Networking (CANDAR).
ences and new capabilities to compete,” 2016, https://www.ibm.com/ IEEE, 2016, pp. 649–654.
downloads/cas/OXV3LYLO. [33] Y. E. Seyyar, A. G. Yavuz, and H. M. Ünver, “An attack detection
[6] M. Reddy, API Design for C++. Elsevier, 2011. framework based on BERT and deep learning,” IEEE Access, vol. 10,
[7] R. Sun, Q. Wang, and L. Guo, “Research Towards Key Issues of API pp. 68 633–68 644, 2022.
Security,” in CNCERT, Beijing, China, July 20–21, 2021, pp. 179–192. [34] E. Chávez, G. Navarro, R. Baeza-Yates, and J. L. Marroquı́n, “Searching
[8] M. Coyne, “Zoom’s Big Security Problems Summa- in metric spaces,” ACM Computing Surveys (CSUR), vol. 33, no. 3, pp.
rized,” https://www.forbes.com/sites/marleycoyne/2020/04/03/ 273–321, 2001.
zooms-big-security-problems-summarized/?sh=46fc370f4641, 2020, [35] A. Ponomarenko, N. Avrelin, B. Naidan, and L. Boytsov, “Comparative
forbes. analysis of data structures for approximate nearest neighbor search,” in
[9] “Capital One data breach: Arrest after details of 106m people stolen,” DATA ANALYTICS, 2014, pp. 125–130.
https://www.bbc.com/news/world-us-canada-49159859, 2019, bBC.
[36] P. Indyk and R. Motwani, “Approximate nearest neighbors: towards
[10] J. Greig, “Hilton denies hack after data from 3.7 million Honors
removing the curse of dimensionality,” in ACM Symposium on Theory
customers offered for sale,” https://therecord.media/hilton-denies-hack-
of Computing, 1998, pp. 604–613.
after-data-from-3-7-million-honors-customer-offered-for-sale/, 2023, the
[37] K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, and H. Zhang, “Fast
Record.
approximate nearest-neighbor search with k-nearest neighbor graph,” in
[11] “Equifax Says Cyberattack May Have Affected 143 Million
AI, 2011.
in the U.S.” https://www.nytimes.com/2017/09/07/business/equifax-
[38] R. Battle and E. Benson, “Bridging the semantic Web and Web 2.0
cyberattack.html, 2017, the New York Times.
with representational state transfer (REST),” Journal of Web Semantics,
[12] D. Fett, R. Küsters, and G. Schmitz, “A comprehensive formal security
vol. 6, no. 1, pp. 61–69, 2008.
analysis of OAuth 2.0,” in SIGSAC CCCS, 2016, pp. 1204–1215.
[39] W. J. Buchanan, S. Helme, and A. Woodward, “Analysis of the adoption
[13] A. Chan, A. Kharkar, R. Z. Moghaddam, Y. Mohylevskyy, A. Helyar,
of security headers in HTTP,” IET Information Security, vol. 12, no. 2,
E. Kamal, M. Elkamhawy, and N. Sundaresan, “Transformer-based
pp. 118–126, 2018.
Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or
Fine-tuning?” arXiv preprint arXiv:2306.01754, 2023. [40] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of Tricks for
[14] G. Baye, F. Hussain, A. Oracevic, R. Hussain, and S. A. Kazmi, “API Efficient Text Classification,” arXiv preprint arXiv:1607.01759, 2016.
security in large enterprises: Leveraging machine learning for anomaly [41] B. Li and L. Han, “Distance weighted cosine similarity measure for text
detection,” in ISNCC. IEEE, 2021, pp. 1–6. classification,” in IDEAL, China, Oct., 2013, pp. 611–618.
[15] E. Harlicaj et al., “Anomaly detection of web-based attacks in microser- [42] E. Techapanurak, M. Suganuma, and T. Okatani, “Hyperparameter-free
vices,” 2021. out-of-distribution detection using cosine similarity,” in ACCV, 2020.
[16] F. Shen, Y. Mu, Y. Yang, W. Liu, L. Liu, J. Song, and H. T. Shen, [43] Y.-C. Hsu, Y. Shen, H. Jin, and Z. Kira, “Generalized odin: Detecting
“Classification by retrieval: Binarizing data and classifiers,” in ACM out-of-distribution image without learning from out-of-distribution data,”
SIGIR, 2017, pp. 595–604. in CVF, 2020, pp. 10 951–10 960.
[17] W. Shi, J. Michael, S. Gururangan, and L. Zettlemoyer, “Nearest [44] P. Xia, L. Zhang, and F. Li, “Learning similarity with cosine similarity
neighbor zero-shot inference,” arXiv preprint arXiv:2205.13792, 2022. ensemble,” Information Sciences, vol. 307, pp. 39–52, 2015.
[18] A. M. Qamar, E. Gaussier, J.-P. Chevallet, and J. H. Lim, “Similarity [45] B. Naidan, L. Boytsov, Y. Malkov, and D. Novak, “Non-Metric Space
learning for nearest neighbor classification,” in ICDM. IEEE, 2008, Library (NMSLIB): An efficient similarity search library and a toolkit
pp. 983–988. for evaluation of k-NN methods for generic non-metric spaces,” https:
[19] J. J. Valero-Mas, A. J. Gallego, P. Alonso-Jiménez, and X. Serra, “Mul- //github.com/nmslib/nmslib, 2014, online; accessed 14-July-2014.
tilabel prototype generation for data reduction in k-nearest neighbour [46] J. E. Stone, K. S. Griffin, J. Amstutz, D. E. DeMarle, W. R. Sherman,
classification,” Pattern Recognition, vol. 135, p. 109190, 2023. and J. Günther, “ANARI: A 3-D Rendering API Standard,” Computing
[20] A. S. Reddy and B. Rudra, “Evaluation of Recurrent Neural Networks in Science & Engineering, vol. 24, no. 2, pp. 7–18, 2022.
for Detecting Injections in API Requests,” in CCWC, 2021, pp. 0936– [47] C. Torrano-Gimenez, H. T. Nguyen, G. Alvarez, S. Petrović, and
0941. K. Franke, “Applying feature selection to payload-based web application
[21] J. Ombagi, “Time-Based Blind SQL Injection via HTTP Headers: firewalls,” in IWSCN, 2011, pp. 75–81.
Fuzzing and Exploitation.” 2017. [48] B. Ware, “Analyzing Web Traffic ECML/PKDD 2007 Discovery Chal-
[22] I. Jemal, M. A. Haddar, O. Cheikhrouhou, and A. Mahfoudhi, “Perfor- lenge,” 2007, https://www.lirmm.fr/pkdd2007-challenge/comites.html.
mance evaluation of Convolutional Neural Network for web security,” [49] UNB, “A collaborative project between the Communications Security
Computer Communications, vol. 175, pp. 58–67, 2021. Establishment (CSE) & the Canadian Institute for Cybersecurity (CIC),”
[23] M. Gniewkowski, H. Maciejewski, T. R. Surmacz, and W. Walen- 2018, https://www.unb.ca/cic/datasets/ids-2018.html.
tynowicz, “HTTP2vec: Embedding of HTTP Requests for Detection of [50] B. Damele and M. Stampar, “SQLMAP: Automatic SQL injection and
Anomalous Traffic,” ArXiv, vol. abs/2108.01763, 2021. database takeover tool (2015),” http://sqlmap.org.
[24] I. Jemal, M. A. Haddar, O. Cheikhrouhou, and A. Mahfoudhi, “M-CNN: [51] Icesurfer and Nico, “SQLNINJA: SQL Server injection & takeover tool
a new hybrid deep learning model for web security,” in AICCSA, 2020, (2007),” https://sqlninja.sourceforge.net.
pp. 1–7. [52] S. Bennetts, “OWASP Zed attack proxy,” AppSec USA, 2013.
13
[53] Y. Fang, J. Peng, L. Liu, and C. Huang, “WOVSQLI: Detection of SQL Udi Aharon He is currently pursuing a Ph.D. de-
injection behaviors using word vector and LSTM,” in CSP, 2018, pp. gree within the Department of Computer Science at
170–174. Ariel University, Israel. His research activities span
[54] C. T. Giménez, A. P. Villegas, and G. Á. Marañón, “HTTP data set the fields of Machine Learning and Cybersecurity.
CSIC 2010,” CSIC, vol. 64, 2010. Specifically, the primary focus of his work is on
[55] S. Lavian, R. Dubin, and A. Dvir, “The API Traffic Research enhancing API security through the application of
Dataset Framework (ATRDF),” 2023, https://github.com/ArielCyber/ text-based models.
Cisco Ariel Uni API security challenge.
[56] P. M. S. Sánchez, J. M. J. Valero, A. H. Celdrán, G. Bovet, M. G.
Pérez, and G. M. Pérez, “A survey on device behavior fingerprinting:
Data sources, techniques, application scenarios, and datasets,” IEEE
Communications Surveys & Tutorials, vol. 23, no. 2, pp. 1048–1077, Ran Dubin Received his B.Sc., M.Sc., and Ph.D.
2021. degrees from Ben-Gurion University, Beer Sheva,
[57] B. R. Dawadi, B. Adhikari, and D. K. Srivastava, “Deep Learning Israel, all in communication systems engineering.
Technique-Enabled Web Application Firewall for the Detection of Web He is currently a faculty member at the Computer
Attacks,” Sensors, vol. 23, no. 4, p. 2073, 2023. Science Department, Ariel University, Israel. His
[58] H. Mac, D. Truong, L. Nguyen, H. Nguyen, H. A. Tran, and D. Tran, research interests revolve around zero-trust cyber
“Detecting attacks on web applications using autoencoder,” in ICT, 2018, protection, malware disarms and reconstruction, en-
pp. 416–421. crypted network traffic detection, Deep Packet In-
[59] J. Wang, Z. Zhou, and J. Chen, “Evaluating CNN and LSTM for web spection (DPI), bypassing AI, Natural Language
attack detection,” in ICMLC, 2018, pp. 283–287. Processing, and AI trust.
[60] M. Ito and H. Iyatomi, “Web application firewall using character-level
convolutional neural network,” in CSPA, 2018, pp. 103–106.
[61] I. Jemal, M. A. Haddar, O. Cheikhrouhou, and A. Mahfoudhi, “Perfor- Amit Dvir Amit Dvir received his B.Sc., M.Sc.,
mance evaluation of Convolutional Neural Network for web security,” and Ph.D. degrees from Ben-Gurion University, Beer
Computer Communications, vol. 175, pp. 58–67, 2021. Sheva, Israel, all in communication systems engi-
[62] L. Yan and J. Xiong, “Web-APT-Detect: a framework for web-based neering. He is currently a Faculty Member in the
advanced persistent threat detection using self-translation machine with Department of Computer Science and the head of
attention,” IEEE Letters of the Computer Society, vol. 3, no. 2, pp. 66– the Ariel Cyber Innovation Center, Ariel University,
69, 2020. Israel. From 2011 to 2012, he was a Postdoctoral
[63] V. L. Pochat, T. Van Goethem, S. Tajalizadehkhoob, M. Korczyński, Fellow at the Laboratory of Cryptography and Sys-
and W. Joosen, “Tranco: A research-oriented top sites ranking hardened tem Security, Budapest, Hungary. His research inter-
against manipulation,” arXiv preprint arXiv:1806.01156, 2018. ests include enrichment data from encrypted traffic.
[64] Y. Zhao, Z. Nasrullah, and Z. Li, “PyOD: A python toolbox for scalable
outlier detection,” arXiv preprint arXiv:1901.01588, 2019.
Chen Hajaj Chen Hajaj holds Ph.D. (Computer
[65] M. Aumüller, E. Bernhardsson, and A. Faithfull, “ANN-Benchmarks:
Science), M.Sc. (Electrical Engineering), and B.Sc.
A benchmarking tool for approximate nearest neighbor algorithms,”
(Computer Engineering) degrees, all from Bar-Ilan
Information Systems, vol. 87, p. 101374, 2020.
University. He is a faculty member in the Depart-
[66] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
ment of Industrial Engineering and Management, the
of deep bidirectional transformers for language understanding,” arXiv
head of the Data Science and Artificial Intelligence
preprint arXiv:1810.04805, 2018.
Research Center, and a member of the Ariel Cyber
[67] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
Innovation Center. From 2016 to 2018, Chen was a
L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized
postdoctoral fellow at Vanderbilt University. Chen’s
BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
research activities are in the areas of Machine Learn-
[68] H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,”
ing, Game Theory, and Cybersecurity. Specifically,
in 2021 International Joint Conference on Neural Networks (IJCNN).
the focuses of his work are on encrypted traffic classification, how to detect
IEEE, 2021, pp. 1–8.
and robustify the weak spots of AI methods (adversarial artificial intelligence),
[69] A. Arning, R. Agrawal, and P. Raghavan, “A Linear Method for
and multimodal classification techniques.
Deviation Detection in Large Databases.” in KDD, vol. 1141, no. 50,
1996, pp. 972–981.
[70] T. Yu, H. Fei, and P. Li, “U-BERT for Fast and Scalable Text-
Image Retrieval,” in Proceedings of the 2022 ACM SIGIR International
Conference on Theory of Information Retrieval, 2022, pp. 193–203.
[71] J. Xin, R. Tang, Y. Yu, and J. Lin, “BERxiT: Early exiting for BERT
with better fine-tuning and extension to regression,” in Proceedings of
the 16th Conference of the European Chapter of the Association for
Computational Linguistics: Main Volume, 2021, pp. 91–104.
[72] A. Rücklé, G. Geigle, M. Glockner, T. Beck, J. Pfeiffer, N. Reimers,
and I. Gurevych, “AdapterDrop: On the efficiency of adapters in trans-
formers,” arXiv preprint arXiv:2010.11918, 2020.
[73] I. Tarunesh, S. Aditya, and M. Choudhury, “Trusting RoBERTa over
BERT: Insights from checklisting the natural language inference task,”
arXiv preprint arXiv:2107.07229, 2021.
[74] P. Rajapaksha, R. Farahbakhsh, and N. Crespi, “BERT, XLNet or
RoBERTa: the best transfer learning model to detect clickbaits,” IEEE
Access, vol. 9, pp. 154 704–154 716, 2021.
[75] Y. Jia, “Design of nearest neighbor search for dynamic interaction
points,” in 2021 2nd International Conference on Big Data and In-
formatization Education (ICBDIE). IEEE, 2021, pp. 389–393.
[76] F. Cheng, R. J. Hyndman, and A. Panagiotelis, “Manifold learning with
approximate nearest neighbors,” ArXiv, 2021.
[77] AWS, “Build k-Nearest Neighbor (k-NN) similarity
search engine with Amazon Elasticsearch Service,”
2020, https://aws.amazon.com/about-aws/whats-new/2020/03/
build-k-nearest-neighbor-similarity-search-engine-with-amazon-elasticsearch-service.