0% found this document useful (0 votes)
11 views23 pages

Discovering API Usage Specifications for Security Detection Using Two-stage Code Mining

Yin et al. propose a two-stage API specification mining approach to enhance security detection by extracting API usage specifications, including conditions and semantic relationships. This method improves upon existing techniques by efficiently mining context-sensitive API sequences and constructing an API relationship graph, yielding better results in discovering API specifications. Experimental results demonstrate its effectiveness in identifying security-related API call violations, aiding in vulnerability analysis and software patching.

Uploaded by

Suraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views23 pages

Discovering API Usage Specifications for Security Detection Using Two-stage Code Mining

Yin et al. propose a two-stage API specification mining approach to enhance security detection by extracting API usage specifications, including conditions and semantic relationships. This method improves upon existing techniques by efficiently mining context-sensitive API sequences and constructing an API relationship graph, yielding better results in discovering API specifications. Experimental results demonstrate its effectiveness in identifying security-related API call violations, aiding in vulnerability analysis and software patching.

Uploaded by

Suraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Yin et al.

Cybersecurity (2024) 7:30 Cybersecurity


https://doi.org/10.1186/s42400-024-00224-w

RESEARCH Open Access

Discovering API usage specifications


for security detection using two‑stage code
mining
Zhongxu Yin1*, Yiran Song2† and Guoxiao Zong1*†

Abstract
An application programming interface (API) usage specification, which includes the conditions, calling sequences,
and semantic relationships of the API, is important for verifying its correct usage, which is in turn critical for ensur-
ing the security and availability of the target program. However, existing techniques either mine the co-occurring
relationships of multiple APIs without considering their semantic relationships, or they use data flow and control flow
information to extract semantic beliefs on API pairs but difficult to incorporate when mining specifications for mul-
tiple APIs. Hence, we propose an API specification mining approach that efficiently extracts a relatively complete
list of the API combinations and semantic relationships between APIs. This approach analyzes a target program
in two stages. The first stage uses frequent API set mining based on frequent common API identification and filtra-
tion to extract the maximal set of frequent context-sensitive API sequences. In the second stage, the API relationship
graph is constructed using three semantic relationships extracted from the symbolic path information, and the speci-
fications containing semantic relationships for multiple APIs are mined. The experimental results on six popular
open-source code bases of different scales show that the proposed two-stage approach not only yields better results
than existing typical approaches, but also can effectively discover the specifications along with the semantic rela-
tionships for multiple APIs. Instance analysis shows that the analysis of security-related API call violations can assist
in the cause analysis and patch of software vulnerabilities.
Keywords Specification mining, Frequent API sequence, Semantic relationship, Under-constrained symbolic
execution, Vulnerability mining

Introduction requirements of the program. Program developers


In programming, calls to application programming who fail to adhere to these specifications can easily
interface (API) functions usually need to follow a introduce defects into the program, which may even
particular specification. These specifications gener- cause serious security problems. In SSLINT (He et al.
ally express the intrinsic characteristics and security 2015), the author manually built the API security usage
specifications of the certificate validation process

Yiran Song and Guoxiao Zong have contributed equally to this work.
in the open-source projects OpenSSL and GnuTLS,
and found a dozen or so lack of verification errors in
*Correspondence:
Zhongxu Yin the source code that could lead to man-in-the-middle
[email protected] attacks. Unfortunately, although some APIs give for-
Guoxiao Zong mal usage specifications in the official documentation,
[email protected]
1
Information Engineering University, Zhengzhou 450001, China many APIs do not have public usage specifications
2
Henan University of Animal Husbandry Economy, Zhengzhou 450046, (Jana et al. 2016). Because the same set of APIs may be
China widely used, the vulnerability caused by violations of

© The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/.
Yin et al. Cybersecurity (2024) 7:30 Page 2 of 23

the specification is highly likely to be reproduced in dif- tions containing semantic relationships for multiple
ferent applications. Therefore, the mining of API usage APIs. The approach has two features.
specifications in a program has become an important
(a) Mainly from source code. The approach can
aspect of program security analysis.
mine specifications mainly from software code
Some methods automatically extract API usage specifi-
without any prior knowledge about the soft-
cation and use the specifications for security analysis. To
ware or requiring any rule templates, annota-
mine the association relationships of APIs, PR-Miner (Li
tion, or feedback from programmers.
and Zhou 2005) uses an effective frequent itemset mining
technique. Frequent itemset mining is a branch of data (b) Domain adapted. Our method is able to
mining that focuses on looking at sequences of actions or deal with longer and relatively complete API
events, each of which have a number of features. The aim sequences. Meanwhile, path pruning based on
of a frequent itemset mining algorithm is to find all com- domain adapting can obtain better extraction
mon sets of items, defined as those itemsets that have efficiency and scalability.
at least a minimum support (Tamaskar and Raut 2016).
However, the rules extracted by PR-Miner are redundant The experimental results show that the approach can
and lack of parameters of and conditional dependencies mine usage specifications that contain the semantic rela-
between APIs. tionships among multiple APIs. Moreover, the accuracy
For other mining methods such as frequent subgraph and efficiency are clearly better than those of the existing
mining (Huan et al. 2004) and under-constrained sym- typical approaches.
bolic execution (Ramos and Engler 2015), the seman-
tic beliefs are included. Instead of performing symbolic 2. We proposed the concept of frequent common API
execution from the start position of the program, under- in the problem domain of frequent API sequence
constrained symbolic execution assumes a certain pre- mining. The structural characteristics of frequent
condition and starts analysis directly from the entry of common APIs in the FP tree are proposed and
the target function of interest. Even so, the computing engaged with an accurate identification and filtra-
cost is relatively large and cannot be scaled to large pro- tion method of frequent common API, which greatly
grams. There are still plenty of complex functions in large improved the effectiveness of our proposed frequent
and complex programs, the overhead is still unaffordable. API maximum sequence mining algorithm. An FP
In this paper, our aim is to use an API specification tree is the main structure for representing itemsets
mining approach that efficiently extracts a relatively com- in the frequent closed-itemset mining algorithm
plete list of the API combinations and semantic relation- FPclose(Grahne and Zhu 2003a). Some APIs are
ships between APIs. The extracted API specifications and called frequently, but they are mainly common func-
security-sensitive function model (Yin et al. 2020) consti- tions that implement error message prompts after an
tute the static method to detect the call sequences that error. Such APIs are referred as frequent common
violates the API specification and for bug discovery and APIs in this paper. The algorithm identifies and filters
repairment. out the frequent common API nodes in the FP tree
The existing methods of extracting the relationship that are close to the root node and have a large over-
between APIs are usually limited to the relationship all frequency and out degree, which reduces their
between pairwise APIs. Our method extracts the rela- interference on the mining results.
tionship in all related API sequences, and then recover 3. We implemented the proposed approach and made
the API call specifications without documents. evaluations. The results show that the proposed
To achieve the goal above, we proposed an approach to semantic relationship specification mining approach
analyze the target program in two phases. The first phase is effective and superior to relevant approaches.
extracts context-sensitive frequent API sequences using Further analysis on violations of the specifications
frequent itemset mining. The second phase takes the reveals various security problems.
sequences mined in the first phase and mines the speci-
fications that contain the semantic relationships of mul- The remainder of this paper is organized as fol-
tiple APIs using under-constrained symbolic execution. lows: Sect. "Related work" presents the related work.
The main contributions of this work are: Sect. "Problem definition" analyzes the key issues of this
paper for specification mining through a motivating
1. A new API specification mining approach that can example. sect. "Semantic relationship sensitive API speci-
automatically and efficiently extract API specifica- fication mining approach" describes the overall approach
Yin et al. Cybersecurity (2024) 7:30 Page 3 of 23

of our method, the improved maximal frequent itemset This approach can discover the calling order and semantic
mining algorithm, and the semantic relationship extrac- relationships of multiple APIs in a program.
tion and call specification mining of APIs. Sect. "Imple- The security-sensitive function-based mining
mentation and evaluation" describes the implementation approaches implement path-sensitive and flow-sensitive
and presents the evaluation results. Sect. "Analysis of API specification mining for specific functions but must first
call specifications violations" states the typical case. Sect. locate sensitive functions. Their effectiveness depends
"Conclusion" concludes our work and discusses direc- on the accuracy of security-sensitive function identifi-
tions for future work. cation, and they cannot discover specifications in APIs
that are not recognized as sensitive functions. Moreover,
the relationships for more than two functions cannot be
Related work obtained. In our approach, we filter unrelated APIs and
API specification mining uses source code mining retain the other APIs instead of just choosing candidate
approaches to discover the conditions, calling sequences, APIs for specification mining. This enables us to have
and semantic relationships of API calls in a program greater coverage and fewer false negatives.
(Dyer et al. 2013). After more than ten years of efforts
by researchers and the development of program analy-
sis technology, research in this area has gained some Frequent itemset mining based approach
achievements. Typical examples include frequent item- The frequent itemset mining based approach uses the
set mining-based approaches, security-sensitive func- program’s statements or structural pattern sequences to
tion-based mining approaches, template-based mining extract frequently occurring subsequences as specifica-
approaches and document-based method. tions. For example, PR-Miner uses the frequent closed-
itemset mining algorithm FPclose (Grahne and Zhu.
2003a) to mine the co-occurrences of statements in the
Security‑sensitive function‑based mining approach function. This approach can mine the co-occurrence rela-
In the security-sensitive function-based mining approach, tionships among multiple APIs but does not reflect the
security-sensitive functions (Chen et al. 2018) are discov- semantic relationships between APIs (Li and Zhou 2005).
ered. The specifications are then revealed by mining the In addition, the false positive rate in the mining results is
pre-and post-conditions of these functions (Nguyen et al high. In the evaluation of PR-Miner, 45 of the top 60 vio-
2015, 2014; Ramanathan et al. 2007)) Liang et al. proposed lations for the extracted rules in the violation report for
AntMiner ((Bian et al. 2018a; Liang et al. 2016), which PostgreSQL were false positives caused by false program-
uses the idea of program slicing to preprocess the source ming rules. In addition, when dealing with multiple API
code and reduce noise interference. It then finds secu- call sequences, the API sequence will be split into multi-
rity-sensitive functions through a heuristic approach and ple pairwise sequences, such as [A, B, C] will be split into
computes their preconditions to yield the specification. [A, B], [A, C], [B, C], which will lead to redundancy in
The Chucky approach (Yamaguchi et al. 2013) proposed subsequent rule extraction. Sequences without semantic
by Yamaguchi et al. uses a manually specified sensi- relationship become the interference items of specifica-
tive function or variable as a starting point. It then slices tion mining.
statements in the calling functions of a sensitive function Henkel et al. proposed a specification mining approach
and clusters the conditional statements in the slice, out- based on unsupervised learning (Henkel et al. 1904). The
putting the analysis result as a usage specification for the approach assumes that APIs in a specification have simi-
sensitive function. The APEX approach (Kang et al. 2016) larities in the function name. Then it clusters the APIs
proposed by Yuan et al. analyzes the post-conditions of with function names. The frequent itemset mining is
each API function called by the program, finds the fal- conducted with the projection of elements in the domain
lible APIs that are sensitive to error handling, and iden- of clusters.
tifies error paths and non-error paths according to the PR-Miner uses FPclose algorithm to get all closed sub-
number of branching points of the path to find the error itemsets labeling with different threshold as confidence
return values that handle specification. The approach pro- for the corresponding rule. The algorithm overcomes
posed by Chang (Chang et al. 2008, 2012) first chooses the problem of traditional frequent pattern mining that
the set of APIs of interest, then for every API, uses each generates excessive pattern results according to threshold
of its call site instances to construct dependence spheres settings.
from a system dependency graph of the target program. There are several works that highly related to maximal
It then performs frequent isomorphic graph minor min- pattern mining, Unil Yun et al. proposed more efficient
ing from the dependence spheres. The frequent isomor- maximal weighted frequent pattern mining considering
phic graph minors are selected as the usage specification. weight information as well as the support values based
Yin et al. Cybersecurity (2024) 7:30 Page 4 of 23

on tree and array structures((Lee and Yun 2018; Yun and sequence mining and semantic relationship extraction in
Lee 2016; Yun et al. 2016)), they also use approximate different stages to improve scalability.
weighted maximal frequent patterns considering error
tolerance(Lee et al. 2016).
Document‑based method
There are also some methods, which extract specifica-
Template‑based specification tions from official or community documentation of open-
The template-based specification mining approach uses a source software for matching. Lv et al. (Lv et al. 2020)
defined specification template to filter out the sequences proposed method useing NLP to extract integration
that match its patterns and adopts a statistical approach assumptions from the library documents and then verify
to select a high-support pattern as a specification (Bian the consistency with the APIs used in a program. Wang
et al. 2018b). For example, Lemieux regards a program et al. (Wang and Zhao 2023) combined source code and
as a linear execution sequence of statements (Lemieux documents to cross-validate and extract patterns.
et al. 2015), extracts the proposition that satisfies the Document-based methods extract relationships of a
user-specified temporal logic template, and uses a statis- code fragment (function scope, file scope, et al.), most of
tical approach to mine the property with the highest con- them extract only return value of pairwise APIs, which
fidence as the true proposition. Then, this method uses has drawbacks on completeness of specification. In addi-
the true proposition to construct the temporal logic as a tion, the inaccuracy of descriptions of the documents
specification. Yun et al. (Yun et al. 2016) records the API may introduce deviation.
node sequence and symbolic execution path by perform-
ing a lightweight static symbolic execution. Then, the fre-
quency of context patterns, including the return value of Problem definition
a single API, the constrained causal relationships of the In this section, we further explain the problems we need
API pairs, and the implicit pre-and post-condition rela- to solve through motivating examples. The functions
tionships of API pairs in the recorded result, is used to for tasks such as access control and protocol processing
find the return processing of a single API and the control in programs are mainly implemented in the code base
dependency relationship specification between pairs of through API calls. The APIs are the main carrier for the
APIs. Such approaches can mine specifications targeted interface and encapsulate the internal states of the pro-
by a particular specification template. However, the types gram. Missing, out-of-order, and lacking checks for API
of specifications that are mined are limited to the types calls can lead to security breaches and performance deg-
defined in the template and the semantic relationships radation. Specification mining extracts a set of associated
extracted are only between specific API pairs. A frame- APIs from multiple instances of API calls to verify their
work was proposed for API usage constraint and misuse correct usage. In these instances, the necessary condi-
classification which describe several typical templates tion checks and calling context restrictions related to the
(Schlichtig et al. 2022). associated API are an important part of the specification.
Template-based approaches are highly targeted and The following typical examples of code introduce the goal
have a low false positive rate, but they cannot effectively of our approach.
mine relationships for specifications containing multiple Figure 1a shows the code snippet that implements
APIs. If they are directly extended to mine multiple API time-stamp verification in the well-known cryptographic
specifications, the complexity of the algorithm increases library OpenSSL, which calls certificate verification APIs
rapidly according to the number of possible API combi- to verify the certificate contained in the time-stamp sig-
nations and number of potential semantic relationships, nature. The following five APIs are called in the code
which can lead to scalability problems. For example, snippet to implement the certificate verification:
in (Chang et al. 2008, 2012), the frequent isomorphic
graph minor of the program dependency graph centered 1. X509_STORE_CTX_init, which initializes the certifi-
on the selected candidate API is used as a specifica- cate verification environment.
tion. The specification mining problem is then modeled 2. X509_STORE_CTX_set_purpose, which sets the
as a frequent graph minor mining problem, which is an certificate verification purpose.
NP-complete problem. In complex protocol handling 3. X509_verify_cert, which verifies the certificate.
programs, there are often call specifications between 4. X509_STORE_CTX_get1_chain, which obtains the
multiple APIs. Therefore, how to efficiently and accu- certificate chain information after the verification is
rately mine call specifications among multiple APIs is successful.
critical to the security analysis of these programs. In our 5. X509_STORE_CTX_free, which cleans up the cer-
approach, we use a two-stage process and focus on API tificate verification environment.
Yin et al. Cybersecurity (2024) 7:30 Page 5 of 23

We found that violations of the following three situa- APIs cannot be missing. The initialization operation per-
tions will cause security problems. formed by X509_STORE_CTX_init on line 258 is a pre-
requisite for all subsequent API calls. If the code does not
1. The API calls cannot be missing call X509_STORE_CTX_set_purpose on line 260, the
purpose of the verification will be ambiguous. If X509_
Figure 1b shows relationship of APIs mentioned above. STORE_CTX_free is not called on line 273, it will cause a
In the sequence of API calls shown in Fig. 1a, the relevant memory leak in the certificate verification environment.

Fig. 1 OpenSSL code snippet implementing the time-stamp protocol


Yin et al. Cybersecurity (2024) 7:30 Page 6 of 23

2. The order of invocation and control dependencies Again, the data dependence and parameter sharing rela-
among these APIs cannot be violated tionships are also an important part of the specification.
In Fig. 1b, the edges marked with D indicate the data
A later API is called on the premise that an earlier API dependence relationships between the related APIs, and
has already been called. In Fig. 1b, the edges labeled C the edges marked with S indicate the shared parameter
indicate the control dependencies between the rel- relationships among the related APIs.
evant APIs. For example, X509_STORE_CTX_init and For instance, X509_STORE_CTX_init creates the
X509_STORE_CTX_set_purpose must be called before shared parameter relationships among the related APIs.
X509_verify_cert. At the end of the certificate verifica- For instance, X509_STORE_CTX_init creates the cert_
tion, X509_STORE_CTX_free must be called to clear the ctx certificate validation environment and the rest of the
environment. In addition, as shown on lines 258 and 273, APIs share the cert_ctx parameter. Data dependence and
the initialization API X509_STORE_CTX_init and the parameter sharing can distinguish whether the call sites
end API X509_STORE_CTX_free appear as a pair, but of a series of APIs are semantically associated. There-
a X509_STORE_CTX_init call does not necessarily fol- fore, it is necessary to analyze them. In this paper, we
low a call of X509_STORE_CTX_free. When the X509_ refer to the above three relationships of APIs as semantic
STORE_CTX_init call fails, the X509_STORE_CTX_free relationships.
function cannot be called; otherwise, a NULL pointer Clearly, the usage specifications of these APIs should
reference problem will occur. X509_STORE_CTX_get1_ include not only whether they should appear together,
chain must be called if X509_verify_cert returns a certifi- but also the control and data flow relationships among
cate verification success. them, expressed as a condition statement, related param-
eters, and the return values. Fig. 2 shows the client-side
3. The semantic relationships (data dependence and code of the SSL protocol in mbed TLS, which is ARM’s
parameter sharing) cannot be missing open-source encryption library. If the code snippet does

Fig. 2 SSL API calls in main function of dtls_client.c in the mbed TLS library
Yin et al. Cybersecurity (2024) 7:30 Page 7 of 23

not call the mbedtls_ssl_get_verify_result API in line random number k in the ecdsa signature to protect the
229 to verify the certificate or if mbedtls_ssl_read and weak random number generator. The k value requires
mbedtls_ssl_write are called without evaluating their strict protection to prevent leakage. The k value should
return values, the related device may construct a fake be reset to zero after usage. A calling specification was
certificate to disguise itself as an SSL server, bypass- shown in Fig. 13. As it is shown, the call sequence con-
ing the verification process of the client program in the forms to the specifications from the perspective of call
embedded system and enabling a man-in-the-middle sequence and completeness. However, the function
attack. Further, if mbedtls_ssl_free is not called after OpenSSL_cleanse is not called to clean SHA512_CTX,
mbedtls_ssl_read and mbedtls_ssl_write, the resource is but to clean variable private_bytes from the perspective
not released. Finally, if the mbedtls_ssl_close_notify API of parameter sharing. The risk is that after k is gener-
call on line 308 does not conform to the specification, the ated and hash calculated by SHA512, the k value can be
connection will be unstable, and the performance of the restored by the data left in memory of SHA512_CTX in
communication will be heavily degraded. a particular situation.
In addition, inter-procedure analysis should be con- Similarly in pageant handle msg function in Pageant.c
ducted in the specification mining process. The call site of Putty (shown Fig. 16). The hash execution environ-
of the APIs in the sequence could be distributed over- ment data structure is not cleaned after MD5Final is
various functions in a function call chain, such as in called, which will lead to potential memory leaks.
the OpenSSL time-stamp response code in Fig. 3. Here, In summary, the goal of specification mining is to mine
TS_RESP_new and TS_RESP_free should be called as the associated API sequences that must co-occur in the
a pair, but in TS_RESP_create_response, there is only a program and analyze the control dependencies and data
direct call to TS_RESP_new. The call to TS_RESP_free flow relationships among the APIs in the sequence.
is executed indirectly through TS_RESP_CTX_free, and As described in Section Related work, current fre-
intra-procedure specification mining approaches cannot quent itemset mining based approaches mine the
discover this association. co-occurring relationships of multiple APIs without
It will introduce important risks without data flow considering their semantic relationships. They lack flow-
and parameter sharing analysis. As shown in Fig. 4, the sensitive, path-sensitive, and context-sensitive analysis
BN_generate_dsa_nonce function in OpenSSL-1.1.1 is and do not obtain information about the data and con-
called by the ecdsa_sign_setup function to generate a trol flows of the APIs. In addition, these approaches

Fig. 3 TS_RESP_create_response function in ts_rsp_sign.c of OpenSSL-1.1.1 and its time-stamp API calls
Yin et al. Cybersecurity (2024) 7:30 Page 8 of 23

openssl-1.1.1a\crypto\bn\bn_rand.c
205 int BN_generate_dsa_nonce(BIGNUM *out, const BIGNUM *range,
const BIGNUM *priv, const unsigned char *message,
size_t message_len, BN_CTX *ctx)
{
209 SHA512_CTX sha;
……
223 k_bytes = OPENSSL_malloc(num_k_bytes);
224 if (k_bytes == NULL)
goto err;
….
241 for (done = 0; done < num_k_bytes;) {
242 if (RAND_priv_bytes(random_bytes, sizeof(random_bytes)) != 1)
243 goto err;
244 SHA512_Init(&sha);
245 SHA512_Update(&sha, &done, sizeof(done));
246 SHA512_Update(&sha, private_bytes, sizeof(private_bytes));
247 SHA512_Update(&sha, message, message_len);
248 SHA512_Update(&sha, random_bytes, sizeof(random_bytes));
249 SHA512_Final(digest, &sha);
}
……
266 err: OPENSSL_cleanse(private_bytes, sizeof(private_bytes));
267 return ret;
}
Fig. 4 The code of BN generates dsa nonce function in OpenSSL-1.1.1 calls hash related API

lack inter-procedure analysis. The security-sensitive Semantic relationship sensitive API specification
function-based approaches are limited to the semantic mining approach
relationship analysis of the functions that are specified Current popular open-source projects usually have
as sensitive. Moreover, in template-based and security- strong modularity with good API design, and in the main
sensitive function-based approaches, the target program project code, the relevant APIs are correctly called in a
is represented in the form of symbolic paths or program certain number of instances. Under these assumptions,
dependency graphs. The data flow and control flow infor- this study focuses on specification mining for open-
mation are included in the mined specification. However, source projects. Our approach efficiently extracts a rela-
these approaches are difficult to extend to tasks that mine tively complete list of the API combinations for the code
specifications for multiple APIs. of the target program under analysis, which reduce the
To address this problem, we propose a two-stage API unnecessary analysis when dealing with multiple APIs
call specification mining approach that that efficiently in co-occurrence relationship. Based on this, the seman-
extracts a relatively complete list of the API combinations tic relationship between APIs is extracted for an accu-
and semantic relationships between APIs. It focuses on rate analysis with lower costs compared with traditional
API sequence mining and semantic relationship extrac- symbolic execution. The main process of this approach is
tion in different stages to mine specifications for multiple shown Fig. 5. which consist of two stages. The input of
APIs in a scalable way.
Yin et al. Cybersecurity (2024) 7:30 Page 9 of 23

Interprocedural Frequent
Preprocessing API sequence common API
extraction filtering
API calling
specification
First Stage Frequent API Frequent
Source
Code sequence sequence mining

Mining

API pair semantic Multi-API relationship


Under-constrained
relationship extraction based on Interference
symbolic execution
extraction relation graph APIs
Report
API semantic Discard
Second Stage relationship type

Violations of
Counter Further
API call
examples analysis
specifications

Fig. 5 Overview of the proposed specification mining approach

our approach is the source code of the projects, and the Stage 2: API semantic relationship specification mining
API calling specification is mined as output. In the pro- based on domain adapted under-constrained symbolic
cess, counter examples will be reported and analyzed to execution and graph based relationship aggregation.
distinguish interference APIs and violations of API call This stage performs a domain adapted under-con-
specifications. strained symbolic execution to extract the API semantic
Stage 1: Frequent API set mining based on frequent relationship specification from frequent API sequences
common API identification and filtration. of the first stage. First, a path-and flow-sensitive analy-
In this stage, frequent API set mining is performed sis of the code of the target program under analysis is
on the given open-source project to find frequent API performed for the under-constrained symbolic execu-
sequences. First, the number of API calls related to the tion, and the frequent API sequences are used as trigger
specifications may be small in some projects, which points to record the symbolic information and path con-
makes it difficult to extract valid API sequences through straint information of the API in these sequences. Then,
frequent itemset mining approaches. Hence, we use cli- the control dependencies among the APIs are extracted
ent code calling these libraries for specification mining. from the path constraint information, and the pairwise
Then, a global interprocedural analysis is performed on relationships of shared parameters and return value
the target source code, and a call graph is constructed data dependencies for the APIs are extracted from the
to extract inter-procedure API sequences by traversing symbolic information of the API call site. Using the API
each node (which corresponds to a function) in the call sequences and their pairwise semantic relationships, the
graph. Then, for the extracted sequence set, the maxi- API relationship graph is constructed, and the relation-
mal frequent itemset mining approach (Grahne and Zhu ships are aggregated. The context-sensitive, path-sensi-
2003a, 2003b) is used to obtain frequently appearing tive, and flow-sensitive specifications for multiple APIs
API sequences. Some APIs in the program have a higher are mined from the graph. The scope of related symbolic
overall frequency and are distributed among multiple path and constraint information collection at this stage
frequent sequences, but the correlation between these is limited to the relatively small-scale API sequence set
APIs and the APIs that implement the main function in the frequent sequences. Moreover, only the relation-
is not strong. Typical examples include APIs related ships of the APIs in this limited set are mined, which sig-
to error handling and log operations. This increases nificantly reduces the scope and the search states of the
the number of frequent APIs that are not semantically relationships.
related, affecting the quality of the specification and
occupying a large amount of algorithm execution time. Frequent API set mining based on frequent common API
Hence, such APIs are first discarded before frequent identification and filtration
API mining is performed. This effectively improves Definition of frequent common API
the quality of the specification and the efficiency of the We define frequent common APIs illustrated below as
algorithm. interference APIs, which interfere the mining frequent
Yin et al. Cybersecurity (2024) 7:30 Page 10 of 23

APIs. In the remainder of this section, we would explain functions, a global function call graph is defined. In
the motivation. the function call graph, a node represents a function,
In the first stage of analysis, the API sequences are first and an edge represents a calling relationship between
extracted from the code of the target program under the functions. Then, using the source code compilation
analysis, and the frequent itemset mining algorithm is engine, each node in the function call graph is traversed
used to mine API sequences with two or more combined and the inter-procedure API call sequence is extracted.
APIs to find the API co-occurrence relationships. A fre- For the third problem, we combine the filtration of
quent itemset is a set of sequences that frequently appear interference APIs with the classic FPMAX algorithm to
in a data set and whose degree of support is greater than mine maximally frequent API sequences to improve the
or equal to the minimum support degree (min_sup), quality and efficiency of specification mining (Grahne
where the support degree refers to the frequency at and Zhu 2003b).
which a certain set appears in all sequences. For speci- The approach of PR-Miner use FPclose algorithm to
fication mining, the existing frequent itemset-based mine only the closed sub-itemsets. A closed sub-item-
approach has the following three problems. set is the sub-itemset whose support is different from
that of its super-itemsets (Chang and Podgurski 2012).
1. For some target open-source projects, specification- For example, in an itemset database
related APIs may occur infrequently, resulting in     
insufficient support. This makes it difficult to extract D = a, b, c, d, e , a, b, d, e, f , a, b, d, g , a, c, g, h
valid API sequences.
The frequent sub-itemset {b}, {d}, {a,b}, {a,d} and
2. Without context-sensitive analysis, it is difficult to
{b,d} are not closed since their
 supports are the same
find inter-procedure API sequences and related spec-
as their super itemset {a,b,d} a, b, d . FPclose generates

ifications.
the closed sub-itemsets {a}{a}:4 and {a,b,d} a, b, d :3

3. There is no filtration of frequently occurring com-
as closed sub-itemset. While FPMAX would generates
mon functions. If these APIs are not filtered, they
the itemsets {a,b,d} a, b, d :3 as maximal frequent sub-
are likely to form frequent co-occurring relationships
itemset with the minimum support threshold of 3. In
with other APIs, resulting in the output of many
our approach, we only mine the frequent API sequence
redundant or unrelated API sequences, increasing
exceeded the specified threshold. One of our goals is to
the number of invalid specifications and using up the
get a relatively complete API calling sequence includ-
algorithm’s calculation resources. Therefore, frequent
ing all the subsequences without considering the differ-
common APIs should be filtered to reduce interfer-
ent confidence of the subsequences. So, we use FPMAX
ence.
instead of FPclose and FPGrowth algorithm.
The classic FPMAX algorithm requires each ele-
To address the first problem, the following two strat- ment to appear only once in each call sequence, so we
egies are adopted to increase the frequency of the deleted duplicate APIs in the sequences. Then, the total
specification-related APIs: 1) related code is added by frequency of each API in the sequences is counted, and
introducing other project files that reference the code those APIs whose frequencies are lower than the mini-
base and 2) functional verification and performance mum support are filtered out of each sequence. Then, the
testing code that large open-source projects usually APIs are sorted in descending order according to their
provide (usually in the "test" folder) is added. These frequency. The APIs in each sequence and their frequen-
additional lines of code usually do not consider secu- cies are inserted into the FP tree one by one to construct
rity, which is reflected in their non-compliance with it. The insertions begin at the root node. If the API to be
the specified semantic relationships among APIs. In the inserted does not exist inthe tree, a new branch is cre-
second phase of the analysis, we exclude these supple- ated. Otherwise, the frequency of the API is added to the
mentary lines of code. frequency of the corresponding node. Fig. 6 shows a por-
For the second problem, a global inter-procedural tion of an FP tree constructed from the API sequences
analysis is performed to extract the API call sequences extracted from the libTIFF-4.0.10 code. Here, φ is the
across functions by means of a function call graph root node of the FP tree, and the remaining node show
constructed by the compiler framework. First, the the name of the corresponding APIs and their frequen-
source code is analyzed by the compiler front end, cies on the path. Each vertical path (solid line connec-
and the abstract syntax tree and control flow graph in tion) in the FP tree is a data item set that satisfies the
the function are constructed through lexical and syn- minimum support degree in the sequence. To quickly
tax analysis. Then, using the call relationships between
Yin et al. Cybersecurity (2024) 7:30 Page 11 of 23

Header :0
table

TIFFmalloc TIFFmalloc:19 TIFFfree:122 TIFFErrorExt:45 TIFFError:24

TIFFfree
TIFFOpen:2
TIFFErrorExt:28 TIFFmalloc:91
TIFFError:5
TIFFError TIFFRasterScanlineSize:12

TIFFWriteDirectoryTagData:12 TIFFScanlineSize:3
TIFFErrorExt TIFFErrorExt:10
TIFFClose:2

TIFFStripSize
TIFFStripSize:2
TIFFReadScanline:2
_TIFFFillStriles

_TIFFFillStriles:2

Fig. 6 Part of an FP tree for the API sequences mined from libTIFF-4.0.10

access the same items in the tree, all the same items are Improved FPMAX algorithm
connected using a linked list through the header table, In the improved FPMAX algorithm, the maximal fre-
which points to the linked list and is represented in the quent API sequence is tracked by the global MFI tree
Fig. 6 by horizontal dashed lines. (Maximal Frequent Itemset tree). The MFI tree also starts
By analyzing the characteristics of the nodes of inter- from root node φ. For each frequent sequence extracted
ference APIs (TIFFErrorExt and TIFFError) in the FP through the FP tree, if the sequence is not a subsequence
tree Fig. 6, we can find that a candidate interference API starting at any node in the MFI tree, it is inserted into the
has the following three characteristics: MFI tree. The insertion of sequences into the MFI tree is
the same as the insertion of nodes into the FP tree. The
1. The overall frequency is over a certain value. Accord- algorithm starts with a sequence initialized to be empty
ing to its definition, an interference API must fre- as a prefix of the FP tree, and performs recursive process-
quently appear in the code of the target program ing as follows:
under analysis.
2. The node appears in the upper layers and is closer to 1. If there is only one path in the tree, set the level of
the root node. Because the overall frequency of inter- path support to the minimal value of frequency of
ference APIs tends to be higher, when the elements each API. If the support level is greater than the min-
of the API sequence are sorted by frequency, they imum support level, insert the path into the MFI tree
appear at the top. Hence, when the FP tree is con- and return.
structed in this order, they are always in the upper 2. Identify and filter out interference APIs from the
layers of the tree. child nodes of the root node according to the fre-
3. Its out degree exceeds the average value. Interference quency and degree of the nodes. Put the interference
APIs appear in multiple API sequences. APIs into the look-aside list and reconstruct the FP
tree after filtration.
When these sequences are inserted into the FP tree, the 3. Extract each node pointed to by the header table in
insertion traversal usually first passes through nodesrep- the FP tree as well as its frequency. Link them to the
resenting frequent common APIs, and then the node is prefix sequence to form a new prefix. The support
inserted below nodes corresponding to APIs with a rela- level of the new prefix is the minimum value of the
tively low frequency. Interference API nodes act as the frequency of each API.
parent nodes of multiple less frequent API nodes and 4. If the support is less than the minimum support, dis-
usually have more branches. card the prefix.
Yin et al. Cybersecurity (2024) 7:30 Page 12 of 23

Fig. 7 Pseudocode for the proposed FPMAX algorithm with frequent common API processing

5. If the support is greater than the minimum support, FP tree. Lines 9–11 build and recursively analyze the con-
construct a conditional FP tree that removes the ditional FP tree (represented by Ty T ) for each node. In
node, and call the algorithm recursively with the new lines 12–13, when the conditional FP tree is not empty,
prefix and the conditional FP tree as parameters. the algorithm is called recursively with the conditional FP
tree as input.
Figure 7 shows the pseudo code of the improved After this process, the candidate interference APIs in
FPMAX algorithm. Here, input T represents a con- the LookAside list were distinguished. Frequent com-
structed FP tree containing three fields: base, header, and mon APIs, such as {CRYPTO_malloc, CRYPTO_free} in
root. "T.base" is the prefix of the current tree to be mined, OpenSSL and {_TIFFmalloc, _TIFFfree} in the libTIFF
"header" represents the header table of the FP tree, library, have obtained plenty of co-occurrence relation-
"LookAside" is a candidate list for interference APIs, and ships and less significance for subsequent semantic anal-
"MFIT" is a global MFI tree structure. The description of ysis. Other functions like TIFFError and TIFFErrorEXT
the algorithm Fig. 7 is based on the original FPMAX algo- in the libTIFF library, which obtain a high frequency in
rithm inGrahne and Zhu 2003b by supplementing the calling sequence, also increase the cost of analysis.
processing of interference APIs in Lines 5–8. Lines 1–4 of
the algorithm are the terminal path of the recursive pro- API semantic relationship specification mining based
cess. When there is only one path in the tree and the sup- on domain adapted under‑constrained symbolic execution
port is greater than the minimum support, the path and and graph‑based relationship aggregation
its support are inserted into the MFI tree. Lines 5–8 filter The frequent API sequences mined in Sect. "Seman-
the interference APIs and reconstruct the filtered FP tree. tic relationship sensitive API specification mining
In line 6, "midnum" is the minimum support number of approach" are frequently occurring API combinations
interference APIs and "midout" is the lowest out degree that may be semantically independent of each other.
of interference APIs. The values of midnum and midout The second phase further extracts control and data
are calculated by finding the median of support number dependencies on the parameters and return values of
and out degree of the child nodes of the root of the initial
Yin et al. Cybersecurity (2024) 7:30 Page 13 of 23

the APIs in the frequent sequences of the code of the Domain adapted symbolic path constraint information
target program under analysis so that flow-and path- extraction
sensitive specifications may be mined. The current For the first problem, we propose an under-con-
flow-and path-sensitive API call specification mining strained symbolic execution path information extrac-
approaches have the following two problems: tion approach. The scope of the symbolic execution path
recording, and constraint information analysis are lim-
1. The analysis, extraction, recording, and matching of ited to the APIs in the frequent sequences. As shown
all APIs in the code incur large computational and Fig. 8, the variables involved are first symbolized and
storage overheads with low scalability. updated by traversing the statements of all the paths of
2. There is no semantic relationship matching among the program from the entry point of the code control
multiple APIs. When searching for dependence rela- flow graph. Then, the statements in the program are
tionships between arbitrary APIs, as the number of parsed into expressions with symbolic variables and con-
APIs increases, the number of possible relation- stants. If a control statement is encountered, a path-sen-
ships between APIs increases rapidly. As a result, the sitive analysis is performed, and both paths of the branch
search runtime increases sharply, reducing scalabil- targets are independently explored while appending the
ity. control condition to the related path constraint. When
traversing any of the APIs in the frequent API sequence
In this section, we present our method for extract- set, the path constraint is solved and the symbol execu-
ing the symbolic path constraint information for APIs tion environment information, which includes the name
in the frequent sequences output by the first stage and of the API node, the symbol execution path constraint,
construct the API relationship graph according to the the corresponding parameters, and return value informa-
relationships between pairs of APIs to mine multiple tion, is recorded. Finally, the analysis results are recorded.
API call specifications. By reducing the range of symbol path information that is

Symbolized path Path constraint


Symbolic execution
information information
environment information

Global scheduling Under-constrained


policy symbol execution

Traversal and
Target Control flow graph
symbolic analysis Constraint analysis
Program analysis
of program

Frequent API sequence


Fig. 8 Symbol execution analysis based on frequent API sequence

Table 1 Symbol types for the symbol execution trace


Type Name Symbolic form

API function
 
fx, argx (1), argx (2) . . . argx (m) , (retx (1), retx (2) . . . retx (n))
 
fx , argx (1), argx (2) . . . argx (m) , (ret x (1), ret x (2) . . . ret x (n))
Symbolic variable var argx (i)i ∈ {1, m}|ret x (j)j ∈ {1, n}argx (i) i ∈ {1, m}|retx (j) j ∈ {1, n}
Constant const num|srtringnum|string
Comparison operator Δcmp = |� =| < |>| ≤ | ≥ =| ≠| <|> ≤|≥
Expression exp var 1 cmpvar 2 |varcmpconstvar1 cmpvar2 |varcmpconst
Logical ΔL |∧|∨
Constraint information constraint exp1 Lexp2
API call functionCall function,constaint
Trace functionCallSeq functionCallSeq functionCall +
Yin et al. Cybersecurity (2024) 7:30 Page 14 of 23

recorded and reducing the number of instances of con- The analysis in Section Related Work demonstrates that
straint solving, this approach greatly reduces the comput- the semantic relationship between API pairs usually
ing and storage overheads. It also reduces the scope of includes control dependencies, data dependencies, and
the analysis in the next stage. parameter sharing relationships. The control dependency
Table 1 lists the definitions of the symbols recorded relationship of an API pair is reflected by the fact that
during symbolic execution. Here, "trace" represents the the return value of an API helps to determine whether to
recorded symbolic path information, which included the execute another API. For example, the mbed TLS APIs
sequence of the API in the path, and each record of API mbedtls_ssl_read/mbedtls_ssl_write should be called
information includes the name, parameter, return value, only after the mbedtls_ssl_get_verify_result call returns
and path constraint information. successfully, so their path constraint contains the return
value of mbedtls_ssl_get_verify_result (shown Fig. 2).
Multiple API call specification mining A data dependency occurs when the return value or
To address the second problem, this section presents an vari-ables defined by one API and transferred through
API relationship graph that combines pairwise semantic parameters to another API. For example, the mbed
relationships of APIs to obtain a multi-API call specifi- TLS APIs mbedtls_ssl_init is used as the parameter of
cation. We first define the types of pairwise semantic mbedtls_ssl_handshake (shown Fig. 2). Parameter shar-
relationships of APIs. Then, using the symbolized path ing occurs when a pair of APIs have shared parameters.
information previously obtained, the relationships are For example, mbed TLS APIs of mbedtls_ssl_close_notify
matched to pairs of APIs. Then, using these matches, the and mbedtls_ssl_read have a common "&ssl" parameter
API relationship graph is constructed, and the multi-API (shown in Fig. 2).
call specification is obtained by searching the connected The matching rules for determining whether an API
subgraph. Because the symbolic path information only pair (fx,fy) has the above three relationships are shown in
includes APIs that appear in frequent API sequences and Fig. 2. Here, ­argx(i) is ­argy(j) means that parameter ­argx(i)
the number of possible relationships between a limited in function fx is the same variable as ­argy(j) parameter in
number of API pairs is relatively small, the computational function fy, and constraint fy constraint(fy) denotes the
 

overhead of the proposed multi-API call specification symbolic variable contained in the path constraint infor-
mining is greatly reduced. mation of function fy fy.
When mining the relationship between API pairs, we
1. Types of API pairwise semantic relationships and use a corresponding relationship support matrix for
their matching tracking the support of the relationship for each API pair

Fig. 9 Relationship support matrix

Table 2 API relationship matching rules


Relation type Matching rule

Control dependence
   
ControlRelations(seq) = fx , fy |∃i, ret x (i) ∈ constraint fy
ControlRelations(seq) = {(fx , fy )|∃i, retx (i) ∈ constraint(fy )}
Data dependence
    
RetRelations(seq) = fx , fy |∃i, j, ret x (i)isargy (j) RetRelations(seq) = (fx , fy )|∃i, j, retx (i)is argy (j)
Parameter sharing
    
ArgRelations(seq) = fx , fy |∃i, j, argx (i)isargy (j) ArgRelations(seq) = (fx , fy )|∃i, j, arg(i)is argy (j)
Yin et al. Cybersecurity (2024) 7:30 Page 15 of 23

Table 3 Relationship type tags and corresponding dependencies or shared variable set representations
Relation type L τ

Control dependence C NULL NULL


Data dependence D RET x ∩ ARGy , ­RETx ∩ ARG​y, ­RETx represents the return variables set of fx and parameter variables passed to fx which
is also defined in the function fx. ARG​y represents the parameter variables set of fy
Parameter sharing S ARGx ∩ ARGy , ARG​x ∩ ARG​y, ARG​x represents the parameter variables set of fx, ARG​y represents the parameter variables set
of fy

in the frequent API sequences for each relationship. The shared variable set. The representations of L and τ for dif-
elements of this matrix are initialized to 0, as shown in ferent relationship types is shown Table 3.
Fig. 9, where RMControlRelation, RMRetRelation, and RMArgRela-
tion respectively represent the support matrices of control 2. Multi-API semantic relationship aggregation based
dependence, data dependence, and parameter sharing on the API relationship graph
relationships.
Then, for any function pair (fx,fy) fx , fy in each
 
The paired API relationships are combined to obtain the
recorded symbolic path, the different relationships are usage specifications of multiple APIs. First, the frequent
matched according to the rules of Table 2. That is, if the API sequences and the paired relationship list are used to
symbolic variable of a return value of fx is included in the construct an API relationship graph. Then, the algorithm
symbolic variable set related to the path constraint of fy, to find the maximally connected subgraph of a non-con-
there is a control dependency relationship between fx and nected graph (Karp and Tarjan 1980) proposed by Tarjian
fy fx , fy . If the symbolic variable of a return value of fx and is used to find the largest connected subgraph in the API
the symbolic variable of a parameter in the parameter list relationship graph. Finally, connected subgraphs contain-
of fy fy are the same, then there is a data dependency rela- ing at least two nodes are retained and the specifications
tionship between fx and fy. If the symbolic variable of a for multi-API relationships are constructed according to
parameter of fx is the same as the symbolic variable of a the subgraphs.
parameter of fy, then there is a shared parameter relation- For example, using the set of frequent API sequences
ship between fx and fy corresponding element of fx , fy
 
{BN_CTX_new, BN_CTX_free, BN_CTX_start, BN_
(fx,fy) in the relationship support matrix is increased by CTX_end, BN_CTX_get, BN_new, BN_free, BN_copy}
one. Finally, if the If there is a relationship between fx and obtained by analyzing the OpenSSL source code, we can
fy, the corresponding element value of the function pair construct the API relationship graph shown in Fig. 10.
in a relationship support matrix
 is greater than a given The nodes in the figure represent the APIs in the frequent
threshold, the quad fx , fy , L, τ  (fx,fy,(L,τ)) is added to the API sequences. The three relationship lists are traversed,


corresponding relationship list, where the L represents and an undirected edge is added between the nodes
the relationship type and τ represents the dependent or

BN_CTX_new BN_copy
BN_new
D,(BIGNUM *a)

C,!=0 C,=0
BN_CTX_start BN_CTX_get
S,(BN_CTX *ctx)
S,(BN_CTX *ctx)

S,(BN_CTX *ctx)

BN_free
C,!=0

S,(BN_CTX *ctx)
BN_CTX_end BN_CTX_free

Fig. 10 API-pair relationship graph of a frequent API sequence in OpenSSL source code
Yin et al. Cybersecurity (2024) 7:30 Page 16 of 23

Specification 1
APISequence:{ BN_CTX_new, BN_CTX_free, BN_CTX_start, BN_CTX_end, BN_CTX_get}
RelationSequence:
{ [BN_CTX_new,BN_CTX_free,<C,!=0>],
[BN_CTX_new,BN_CTX_start,<C,!=0>],
[BN_CTX_new,BN_CTX_end,<C,!=0>],
[BN_CTX_new,BN_CTX_get,<C,!=0>],
[BN_CTX_new,BN_CTX_free,<D,(BN_CTX *ctx)>],
[BN_CTX_new,BN_CTX_start,<D,(BN_CTX *ctx)>],
[BN_CTX_new,BN_CTX_end,<D,(BN_CTX *ctx)>],
[BN_CTX_new,BN_CTX_start,<D,(BN_CTX *ctx)>],
[BN_CTX_start,BN_CTX_end, BN_CTX_get, BN_CTX_free,<S,(BN_CTX *ctx)>],
[BN_CTX_new,BN_CTX_start,<M,NULL>],
[BN_CTX_start,BN_CTX_get,<M,NULL>],
[BN_CTX_get,BN_CTX_end,<M,NULL>],
[BN_CTX_free,BN_CTX_end,<P,NULL>],
[BN_CTX_free,BN_CTX_start,<P,NULL>],
[BN_CTX_end,BN_CTX_get,<P,NULL>],
}
Specification 2
APISequence:{BN_new,BN_free}
RelationSequence:
{[BN_new,BN_free,<C,!=0>],
[BN_new,BN_free,<D,(BIGNUM *a)>],
[BN_new,BN_free,<M,NULL>]
}
Fig. 11 Specifications of a frequent API sequence in OpenSSL source code

corresponding to the APIs for which a pair relationship interprocess flow analysis for C/C +  + code. The graph
exists. The labels of the edges represent different relation- reachability engine and the symbolic execution engine,
ship types and dependent or shared variable sets. An API which constitute the infrastructure for the data flow anal-
pair can have more than one semantic relationship. For ysis, were used for analysis (Shastry et al. 2016).
instance, BN_CTX_new and BN_CTX_free have both In the first phase of the implementation, the API
control dependence and data dependence relationships. sequence extractor was implemented using the graph
Hence, a node pair in the API relationship graph can have reachability engine, which tracks the call-return seman-
more than one edge. Actually, an API pair can have both tics (context sensitivity) of the procedure call and extracts
control dependence and data dependence or both control the inter-procedure API sequence.
dependence and a parameter sharing relationship. Three API sequences does it solve the path constraint and
connected subgraphs can be found from the API rela- collect symbolic path and constraint information. In
tionship graph. The connected subgraph with the single the analysis, the loop is unrolled only once, so that the
node BN_copy is discarded. For each of the remaining path condition of each API obtained by the local analy-
two connected subgraphs, the API set and theAPI-pair sis can be saved on one node and the analysis result can
relationship set in the subgraph are extracted to obtain be reused in different contexts, thereby further improv-
the two specifications shown in Fig. 11. ing the analysis efficiency. When matching the three
types of semantic relationships between API pairs, the
Implementation and evaluation scope of the analysis is relatively small because the fre-
Evaluation setup quent API sequences have been extracted. To avoid
Using the Clang static analyzer (from the LLVM frame- missing relationships between APIs as much as possi-
work), a specification mining tool called specificsan ble and to exclude the occasional relationship that has a
was implemented. The Clang static analyzer’s symbolic single instance, when the three types of semantic rela-
execution engine performs context-and path-sensitive tionships between the API pairs are evaluated using the
Yin et al. Cybersecurity (2024) 7:30 Page 17 of 23

Table 4 Test target open-source project information


Project name Amount of code (KB) Client code Client code example

libTIFF-4.0.10 2649 8 debian packages using libTIFF Ghostscript, libfox, libwraster, vagrant, etc
Openssh-7.9 3717 16 debian packages using Openssh Vagrant,sshuttle,ssh-krb5,etc
mbedTLS-2.16.0 4206 Microchip’s open-source test case code Microchip’s open-source test case code
OpenSSL-1.1.1 22,922 45 debian packages using Openssl Tinyca openvpn dsniff ssvn, etc
Putty-3.4 2386 5 open-source projects using putty PuttyRider, WebPutty, etc
FFmpeg-3.0.12 10,854 8 open-source projects using ffmpeg FFMpegCore, mobile-ffmpeg, etc
MuPDF-1.14.0 54,272 7 open-source projects using MuPDF Android-MuPDF, mppdf-qt, go-fiz, etc
php-7.0.22 19,148 5 open-source projects using php PHPOfficeSpreadsheet, etc
Pidgin-2.13.0 13,209 5 open-source projects using Pidgin Skyp4pidgin, pidgin-lwqq, etc
zlib-1.2.11 1464 8 open-source projects using zlib Zlib-ng, zlib-searcher, etc

relationship support matrix, the threshold of support is packages using these libraries were used for specification
set to two. The specification mining tests were carried mining.
out on several types of open-source projects as detailed
in Table 4. Further experiments were implemented on Selection of the minimum support number of frequent
six projects of them. libTIFF-4.0.10, OpenSSH-7.9, mbed item
TLS-2.16.0, OpenSSL-1.1.1, Putty-0.7 and zlib-1.2.11. During the mining process, we found that the differ-
Of these, OpenSSL and mbed TLS are well-known pro- ent settings of the minimum support of frequent items
jects implemented with encryption and the SSL/TLS have a large impact on the mining results of frequent
protocol. OpenSSH is a well-known encryption library API sequences. This section compares the mining results
project with SSH protocol. libTIFF is well-known open- of each open-source project with different values for
source projects that deal with the complex formats of the minimum support of frequent items, analyzes the
libTIFF. These projects are widely used, their API inter- changes in mining performance with different settings,
faces are rich, and the correctness of their API calls are and selects the appropriate minimum support number.
important for security. Because of the low frequency of Fig. 12 shows the number of total frequent API sequences
calls to the main function APIs in mbed TLS, Microchip’s extracted from OpenSSL and Openssh as well as the
open-source test case code (MicrochipTech 2019), which effective sequences confirmed.
references mbed TLS, was added. For libTIFF-4.0.10, In the review process, we directly used the seman-
Openssh-7.9 and OpenSSL-1.1.1, client code of debian tic relationship extraction method in the second step

Fig. 12 Specification count and effective number for different minimum support values the FPMAX algorithm for a OpenSSL and b Openssh
Yin et al. Cybersecurity (2024) 7:30 Page 18 of 23

of our approach. The judgment criteria were that if the of the first phase of our approach. To test the effect of
items in the sequence do not have any semantic relation- eliminating interference APIs on the mining results, the
ships with each other, we consider the sequence to be results of the first stage of the approach with and without
ineffective. filtering out the interference APIs are compared.
The minimum support number was dynamically The effective sequences mined by the PR-Miner
and optimally selected for different target projects. As approach and those mined in the first stage of our
Fig. 12a shows, for OpenSSL, as the minimum support approach are merged as the benchmark data of the fre-
number decreases, the total number and effective num- quent API sequences for each open-source project. Then,
ber of mined frequent sequences increases. When the the recall and effective ratios of the different approaches
minimum support setting is reduced from 4 to 3, the are analyzed. The recall ratio
total number is further increased, but the effective num-
RR = TP/(TP + FN)
ber does not increase, Hence, for OpenSSL, the mini-
mum support number was set to 4. As Fig. 12b shows, is the ratio of the correctly reported samples to all bench-
for Openssh, when the minimum support number mark samples. The effective ratio
decreases from 5 to 4, the effective number no longer
increases, so the minimum support number was set to ER = TP/(TP + FP)
5. Using a similar experimental analysis for libTIFF, and
is the ratio of the correctly reported samples to the total
mbed TLS, the minimum support numbers were cho-
number reported.
sen to be 3 and 4, respectively. It is worth mentioning
The results are summarized in Table 5. Compared with
that the minimum support number may be different due
PR-Miner, the RR of the frequent API sequences in the
to the specific features of the open-source projects, the
first stage of our approach is much higher. This is because
amount of code, etc. The range of minimum support
the PR-Miner approach extracts the API sequences
numbers is limited. The experiments are conducted by
intra-procedurally, and it may miss useful APIs in the
estimation and prior analysis, and it’s easy to find the
sequence. Moreover, eliminating interference APIs has
optimal value.
no effect on the RR of the first stage of the approach, but
its ER of the first phase is significantly improved. This is
Specification mining results
because there are many interference APIs in these test
API frequent sequence mining results
objects (Table 6 shows the frequency of some interfer-
Existing approaches such as PR-Miner and ml4spec only
ence APIs), and these interference APIs generate many
mine frequent API sequences, so this section compares
invalid frequent sequences that contain them. By filtering
the results of the PR-Miner approach with the results

Table 5 Comparison of PR-Miner,ml4spec and the proposed method (first stage) results. FCA: frequent common APIs
Project PR-Miner ml4spec FCA count Without FCA removal With FCA removal
RR(%) ER(%) RR(%) ER(%) RR(%) ER(%) RR(%) ER(%)

libTIFF-4.0.10 60 55 66 84 7 74 63 74 85
OpenSSL-1.1.1 53 54 55 80 17 68 58 68 81
OpenSSH-7.9p1 56 64 62 78 11 72 74 72 82
Mbed TLS-2.16.0 57 63 61 74 15 73 63 73 77
Putty-0.7 58 62 63 77 13 70 62 72 78
zlib-1.2.11 60 64 60 76 10 66 70 74 80

Table 6 Number of frequent common APIs found in the experimental projects. FCA name: name of frequent common API, FR:
frequency
Mbed TLS OpenSSL Libtiff OpenSSH Putty Zlib
FCA name FR FCA name FR FCA name FR FCA name FR FCA name FR FCA name FR

Test_fail 302 ERR_PUT_error 3,425 TIFFErrorExt 591 ssh_err 750 safefree 384 gz_error 12
Mbedtls_platform_zeroize 16 BIO_printf 2,084 TIFFError 488 strerror 607 safemalloc 260 free 12
Mbedtls_debug_print_msg 55 ERR_print_errors 472 _TIFFfree 440 logit 273 saferealloc 91 _tr_flush_bits 7
Yin et al. Cybersecurity (2024) 7:30 Page 19 of 23

them out, the invalid sequences are eliminated. At the of public benchmark in API relationship mining. The
same time, the elimination of interference APIs reduces benchmark data is set to the set of APIs relation-
the size of the data set and significantly reduces the pro- ships extracted by the two approaches. If the mini-
cessing time of the algorithm. mum support degree of the API relationship set in the
The RR of the proposed approach is significantly higher APISan approach is greater than or equal to 5, many
than the ml4spec approach. After analyzing the experi- valid relationships will be missed, whereas when it is
mental data, it was found that the ml4spec approach equal to 1, many invalid accidental relationships will
used the text similarity method to cluster and filter the be reported. Therefore, in the experiment, the APISan
API based on the clustering process, and some textually approach’s API relationship minimum support was set
dissimilar APIs were omitted. For example, in the valid to 2, 3 and 4 and then compared with the RR and ER
sequence in OpenSSL {SHA512_Init, SHA512_Update, of the extraction results of our approach. The results
SHA512_Final, Openssl_cleanse}, the sequence mined are shown in Table 7. This table shows that the effi-
by the ml4spec method contains only the first three ciency of our approach is significantly higher than that
functions. In terms of ER, the ml4spec method is supe- of the APISan approach, mainly because the results
rior to the first stage of the proposed approach without extracted by the APISan approach contain many inva-
removing interference APIs and is closer to the case of lid relationships related to frequent common APIs. To
removing interference APIs. This is because the clus- analyze impact of frequent common APIs to the rela-
tering method based on textual similarity can eliminate tionship mining, we made comparison of the result of
some of the interference APIs, but still introduces some under-constrained symbolic execution with and with-
frequent common APIs with similar function names with out removing them in the sequence mined from the
functional APIs, such as TIFFErrorExt, mbedtls_debug_ first stage. As show in Table 7, with the removing of
print_msg, etc. frequent common APIs, the RR is not changed but the
ER is significantly improved. The memory usage and
the processing time is reduced accordingly. It shows
API relationship mining experiment results
that the removing of frequent common APIs is effec-
The goal of our approach is to mine a specification that
tive in reducing the redundant API relationships in
contains the semantic relationships among multiple
the result.
APIs. The current approaches can only efficiently mine
Moreover, the RR of the proposed approach is signifi-
the semantic relationship between API pairs. APISan
cantly higher than the APISan approach. This is because
is a typical approach proposed by Yun et al. (Yun et al.
the APISan only mines control dependencies and does
2016). In this section, we compare the performance
not analyze parameter sharing and data dependency rela-
of the proposed approach with that of the APISan
tionships. In addition, we found that of.
approach at extracting relationships of API pairs.
the total 38 rules, 11 rules were uniquely mined by
Among the three semantic relationships of API
the proposed approach, as compared with the APISan
pairs mined in this paper, APISan only mines con-
approach.
trol dependencies and does not analyze parameter
In terms of memory usage and time overhead, the
sharing and data dependency relationships. The 45
proposed approach is clearly superior to the APISan
debian packages using Openssl as shown in Table 4
approach. This is mainly because the proposed approach
were employed as test data, because of the absence

Table 7 Experimental results of this approach and APISan


Table 8 specification mining results
approach
Program API sequences Mined Correct
Indicators APISan Without With specifications specifications
FCA FCA
th = 2 th = 3 th = 4 removal removal libTIFF 1,534 43 21
RR(%) 71 61 58 95 95 OpenSSH 3,516 98 54
ER(%) 68 78 85 82 91 mbed TLS 705 68 29
Maximum 2544 2437 2305 424 251 OpenSSL 13,406 113 72
memory usage Putty 1,905 89 53
(MB) zlib 310 27 15
Time (s) 542 445 356 322 296
API sequences: the number of API sequences extracted in the first stage, where
45 debian packages using Openssl were employed as test data.th: threshold, RR: each path contains at least two or more APIs, support: the minimum support for
recall rate, and ER: effective rate frequent items selected for different open-source project
Yin et al. Cybersecurity (2024) 7:30 Page 20 of 23

only performs relationship extraction analysis on the


MD5Init SHA512_Init
APIs in the frequent sequences mined from the first
stage, which filters out frequent common APIs and other SharedArgs PathCondition SharedArgs PathCondition
infrequent APIs. The APISan approach extracts and
analyzes the relationships among all APIs. The search- MD5Update SHA512_Update
ing space for the possible API-pair relationship is sub-
SharedArgs PathCondition SharedArgs
stantially increased, so the time and memory overhead
are large. The APISan approach can only analyze partial MD5Final SHA512_Final
code because of the large memory overhead. The lower
SharedArgs PathCondition SharedArgs
memory and time overhead of our approach enables it to
analyze the entire code of the target project. smemclr Openssl_cleanse
In addition, we found that the number of rules
extracted by the proposed approach is much lower (a) (b)

than the number of APISan. This is because the rules Fig. 13 API call specifications of hash operations in Putty
and OpenSSL. a API call specification of hash operation in Putty. b API
extracted by the proposed approach contains the aggre-
call specification of hash operation in OpenSSL
gation of the relationships of multiple APIs, while the
rules extracted by APISan contains the pre/post condi-
tions and return value dependencies between each API
We reported our findings to the OpenSSL development
pair. There is no aggregation of multiple API semantic
team, and it was officially acknowledged. The issue has
relationships, which would greatly increase the number
been fixed in later versions.
of rules and the cost of subsequent vulnerability analy-
sis. The final experiment combines the pairs of relation-
Access control bypass
ships and explores the usage specifications of multiple
Unckeck GnuTLS certificate
APIs. Table 8 shows the specification mining results for
Gnutls is an OpenSSL-like implementation of the SSL
six open-source projects. The number of sequences
protocol used in several projects such as pidgin, scrollz,
extracted, the mining runtime, and the size of the code
and mod_gnutls. The SSL operating specifications is
increase proportionally. The number of specifications is
shown in Fig. 14. The specification is about how an SSL
also consistent with the size of the code except that the
connection determines whether the opponent certificate
number of specifications of mbed TLS-2.16.0 is less than
is valid. If not invoked correctly, the certificate validation
that of OpenSSH-7.9 and that of mbed TLS-1.14.0 is less
function will fail and a risk of "man-in-the-middle attack"
than that of OpenSSL-1.1.1. This is because the protocols
exists by forged certificate.
implemented by OpenSSH-7.9 and OpenSSL-1.1.1 are
Only in the current proxy mode in mod_gnutls module
more complex than those of the others.
after calling gnutls_init, the gnutls_certificate_set_verify_
function function is called for access verification, while
Analysis of API call specifications violations
nothing is called for certificate validation in any other
Counter examples generate after the frequent API sets
path. Such API calling sequence leads to attacks in the
and the API call specifications mining. During our experi-
form of forged certificates.
ment on OpenSSL, Putty, Gnutls and mbed TLS, we made
further analysis on counter examples and found that viola-
Unchecked return value in mbed TLS
tions of API call specifications, such as missing calls, miss-
The program for the embedded system references the
ing checks, ignoring return values, cause security threats
mbed TLS (formerly known as PolarSSL) library, an
likely. API call models constructed by specifications con-
implementation of the embedded SSL protocol. The
tributes to automatic tools (CGF) to mine vulnerabilities.
SSL operation specifications is shown in Fig. 15.
This paper selects two typical security threats caused by
We found two violations of this API call specifica-
violations of API call specifications, including information
tions. The one is in dtls server module. The missing call
leakage and access check bypass, and selects vulnerability
of mbedtls_ssl_get_verify_result to verify the certifi-
examples to analyze and illustrate our thoughts.
cate of client after calling mbedtls_ssl_handshake will
lead to potential SSL man-in-the-middle attacks. The
Information leakage
other is in cert_app module, where the missing check of
Figure 13 shows the API call specifications of hash opera-
return value after calling mbedtls_ssl_close_notify will
tions in Putty-0.70 and OpenSSL-1.1.1. When the hash
cause unilateral close of connection without consulta-
function was called, the clean function needs to be called
tion and affect the usability of the program.
to perform the memory deallocation of the hash variable.
Yin et al. Cybersecurity (2024) 7:30 Page 21 of 23

gnutls_init gnutls_certificate_allocate_credentials

DataDependency PathCondition DataDependency PathCondition

SharedArgs
gnutls_credentials_set gnutls_certificate_set_verify_function

SharedArgs SharedArgs

gnutls_handshake
gnutls_certificate_free_credentials

SharedArgs

gnutls_bye

SharedArgs

gnutls_deinit(session)

Fig. 14 API call specifications of SSL related in Gnutls

The code of BN_generate_dsa_nonce function in


mbedtls_ssl_init OpenSSL-1.1.1 calls hash related API ( Fig. 4). The
code of pageant_handle_msg function in Putty calls
DataDependency hash related API (Fig. 16). Code of unchecked return
value of mbedtls_ssl_close_notify function in cert_app
mbedtls_ssl_setup module in mbed TLS (Fig. 17).

PathCondition SharedArgs Conclusion


This paper proposed an API specification mining
mbedtls_ssl_set_hostname approach that efficiently extracts a relatively complete
list of the API combinations and semantic relationships
PathCondition SharedArgs between APIs. The approach mines the target code in
two stages. The first stage uses the improved maximum
mbedtls_ssl_handshake frequent item-set mining algorithm after frequent com-
mon API identification and filtration to obtain accurate
PathCondition SharedArgs frequent API sequences. Using the results of the first
stage, the second stage employs a semantic relationship
mbedtls_ssl_get_verify_result sensitive API specification automatic mining method
based on domain adapted under-constrained symbolic
PathCondition SharedArgs execution and graph-based relationship aggregation to
mine flow-, path-, and context-sensitive multiple API
mbedtls_ssl_close_notify call specifications. The experimental results show that
the proposed frequent itemset mining algorithm is
PathCondition SharedArgs superior to the classical PR-Miner approach in terms of
efficiency and recall rate. For the final API call speci-
mbedtls_ssl_free fication, not only is the performance of the proposed
API-pair relationship mining better than that of the
Fig. 15 API call specifications of SSL related in mbed TLS existing typical approach of APISan, but it can mine
Yin et al. Cybersecurity (2024) 7:30 Page 22 of 23

putty-0.70\pageant.c
286 void *pageant_handle_msg(const void *msg, int msglen, int *outlen,
void *logctx, pageant_logfn_t logfn)
{
……
306 switch (type) {
……
376 case SSH1_AGENTC_RSA_CHALLENGE:
……
454 MD5Init(&md5c);
455 MD5Update(&md5c, response_source, 48);
456 MD5Final(response_md5, &md5c);
457 smemclr(response_source, 48); /* burn the evidence */
458 freebn(response); /* and that evidence */
459 freebn(challenge); /* and that evidence */
474 break;…
}
……
892 return ret;
}
Fig. 16 The code of pageant handle msg function in Putty calls hash related API

466 mbedtls_printf( "%s\n", buf ); Declarations


……
Competing interests
468 mbedtls_ssl_close_notify( &ssl );
All authors disclosed no relevant relationships.
470 ssl_exit:
471 mbedtls_ssl_free( &ssl );
472 mbedtls_ssl_config_free( &conf ); Received: 3 July 2023 Accepted: 20 February 2024
Fig. 17 Code of unchecked return value of mbedtls_ssl_close_notify
function in cert app module in mbed TLS

References
Bian P et al (2018a) Detecting bugs by discovering expectations and their
multiple API call specifications. Moreover, the mining violations. IEEE Trans Softw Eng 45(10):984–1001
efficiency was also shown to be significantly improved. Bian P et al. (2018) “Nar-miner: Discovering negative association rules from
code for bug detection”. In: Proceedings of the 2018 26th ACM joint
Acknowledgements meeting on European software engineering conference and sympo-
We thank the anonymous reviewers for their helpful remarks. We thank the sium on the foundations of software engineering. pp. 411–422.
editor and the reviewers for their useful feedback that improved this paper. Chang R-y, Podgurski A (2012) Discovering programming rules and viola-
tions by mining interprocedural dependences. J Softw: Evolut Process
Author contributions 24(1):51–66
Zhongxu Yin: Conceptualization of this study, Methodology, Validation. Yiran Chang R-Y, Podgurski A, Yang J (2008) Discovering neglected conditions
Song: Formal analysis, Data Curation. Guoxiao Zong: Investigation, Data Cura- in software by mining dependence graphs. IEEE Trans Softw Eng
tion, Writing—Original draft preparation, Visualization. 34(5):579–596
Chen L et al (2018) Automatic mining of security-sensitive functions from
Funding source code. Comput, Mater Continua. https://​doi.​org/​10.​3970/​cmc.​2018.​
No funding. 02574
Dyer R et al. (2013) “Boa: A language and infrastructure for analyzing ultra-
Availability of data and materials large- scale software repositories”. In: 2013 35th international conference
All data generated or analyzed during this study are included in this published on software engineering (ICSE). IEEE. pp. 422–431.
article.
Yin et al. Cybersecurity (2024) 7:30 Page 23 of 23

Grahne G and Zhu J (2003) “Efficiently using prefix-trees in mining frequent 2013 ACM SIGSAC conference on Computer & communications security.
itemsets.” In: FIMI. Vol. 90 pp 65. 2013: pp 499-510
Grahne G and Zhu J (2003) “High performance mining of maximal frequent Yin Z et al (2020) A security sensitive function mining approach based on pre-
itemsets”. In: 6th International workshop on high performance data min- condition pattern analysis. Comput, Mater Continua 63(2):1013–1029
ing. Vol. 16. pp 34. Yun I et al. (2016) “APISan: Sanitizing API Usages through Semantic Cross-
He B et al. “Vetting SSL Usage in Applications with SSLINT”. In: 2015 IEEE Checking.” In: Usenix Security Symposium. pp. 363–378.
Symposium on Security and Privacy. 2015, pp. 519–534. doi: https://​doi.​ Yun U, Lee G (2016) Incremental mining of weighted maximal frequent item-
org/​10.​1109/​SP.​2015.​38. sets from dynamic databases. Expert Syst Appl 54:304–327
Henkel J et al. (2019) “Enabling Open-World Specification Mining via Unsuper- Yun U, Lee G, Lee K-M (2016) Efficient representative pattern mining based on
vised Learning”. In: arXiv preprint arXiv:​1904.​12098 weight and maximality conditions. Expert Syst 33(5):439–462
Huan J et al. (2004) “Spin: mining maximal frequent subgraphs from graph
databases”. In: Proceedings of the tenth ACM SIGKDD international
conference on Knowledge discovery and data mining. pp 581–586. Publisher’s Note
Jana S, Kang Y J, Roth S, et al. (2016) Automatically detecting error handling Springer Nature remains neutral with regard to jurisdictional claims in pub-
bugs using error specifications[C]//25th USENIX Security Symposium lished maps and institutional affiliations.
(USENIX Security 16). pp 345–362.
Kang Y, Ray B and Jana S . (2016) “Apex: Automated inference of error spec-
ifications for c apis”. In: Proceedings of the 31st IEEE/ACM international
conference on automated software engineering, pp 472– 482.
Karp RM and Tarjan RE . (1980) “Linear expected-time algorithms for
connectivity problems”. In: Proceedings of the twelfth annual ACM
symposium on Theory of computing. pp 368–377.
Lee G et al. “Approximate maximal frequent pattern mining with weight
conditions and error tolerance”. In: International Journal of Pattern
Recognition and Artificial Intelligence 30.06 (2016), p. 1650012.
Lee G, Yun U (2018) Performance and characteristic analysis of maximal fre-
quent pattern mining methods using additional factors. Soft Comput
22:4267–4273
Lemieux C , Park D , and Beschastnikh I . (2015) “General LTL speci- fica-
tion mining (T)”. In: 2015 30th IEEE/ACM international conference on
automated software engineering (ASE). IEEE., pp 81–92.
Liang B et al. (2016) “AntMiner: mining more bugs by reducing noise
interference”. In: Proceedings of the 38th international conference on
software engineering. pp 333–344.
Li Z, Zhou Y (2005) PR-Miner: automatically extracting implicit program-
ming rules and detecting violations in large software code. ACM
SIGSOFT Softw Eng Notes 30(5):306–315
Lv T, Li R, Yang Y, et al. Rtfm! automatic assumption discovery and verifica-
tion derivation from library document for api misuse detection[C]//
Proceedings of the 2020 ACM SIGSAC conference on computer and
communications security. 2020 pp 1837-1852
MicrochipTech. MicrochipTech mbedtls examples. https://​github.​com/​M icro​
chipT​ech/​mbedt​ls-​examp​les. 2019.
Nguyen HA et al. (2014) “Mining preconditions of APIs in large-scale code cor-
pus”. In: Proceedings of the 22nd ACM SIGSOFT international symposium
on foundations of software engineering. pp. 166–177.
Nguyen HA et al. (2015) “Consensus-based mining of API preconditions in big
code”. In: Companion Proceedings of the 2015 ACM SIGPLAN interna-
tional conference on systems, programming, languages and applications:
software for humanity. pp 5–6.
Ramanathan MK, Grama A , and Jagannathan S. (2007) “Static specification
inference using predicate mining”. In: ACM SIGPLAN Notices 42.6, pp
123–134.
Ramos DA and Engler D (2015) “Under-constrained symbolic execution: Cor-
rectness checking for real code”. In: 24th USENIX Security Symposium
(USENIX Security 15), pp 49–64.
Schlichtig M, Sassalla S, Narasimhan K, et al. (2022) Fum-a framework for api
usage constraint and misuse classification[C]//2022 IEEE international
conference on software analysis, evolution and reengineering (SANER).
IEEE, pp 673–684.
Shastry B et al. (2016) “Towards vulnerability discovery using staged program
analysis”. In: detection of intrusions and malware, and vulnerability assess-
ment: 13th international conference, DIMVA 2016, San Sebasti´an, Spain,
July 7–8, Proceedings 13. Springer. 2016, pp 78–97.
Tamaskar SD, Raut AB. Approach for Mining in Lossless Representation of
Closed Itemsets[J]. 2016(11).
Wang X, Zhao L. APICAD: Augmenting API Misuse Detection through Specifi-
cations from Code and Documents[C]//2023 IEEE/ACM 45th International
Conference on Software Engineering (ICSE). IEEE, 2023: 245–256.
Yamaguchi F, Wressnegger C, Gascon H, et al. Chucky: Exposing missing
checks in source code for vulnerability discovery[C]//Proceedings of the

You might also like