0% found this document useful (0 votes)
37 views9 pages

Bindeep

Uploaded by

kzy rùa đá
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views9 pages

Bindeep

Uploaded by

kzy rùa đá
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Expert Systems With Applications 168 (2021) 114348

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

BinDeep: A deep learning approach to binary code similarity detection


Donghai Tian a, b, Xiaoqi Jia c, Rui Ma a, *, Shuke Liu a, Wenjing Liu a, Changzhen Hu a
a
Beijing Key Laboratory of Software Security Engineering Technique, Beijing Institute of Technology, Beijing 100081, China
b
Shanxi Military and Civilian Integration Software Engineering Technology Research Center, North Univeity of China, Taiyuan 030051, China
c
Key Laboratory of Network Assessment Technology, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China

A R T I C L E I N F O A B S T R A C T

Keywords: Binary code similarity detection (BCSD) plays an important role in malware analysis and vulnerability discovery.
Binary code Existing methods mainly rely on the expert’s knowledge for the BCSD, which may not be reliable in some cases.
Deep learning More importantly, the detection accuracy (or performance) of these methods are not so satisfied. To address
Similarity comparison
these issues, we propose BinDeep, a deep learning approach for binary code similarity detection. This method
Siamese neural network
LSTM
firstly extracts the instruction sequence from the binary function and then uses the instruction embedding model
CNN to vectorize the instruction features. Next, BinDeep applies a Recurrent Neural Network (RNN) deep learning
model to identify the specific types of two functions for later comparison. According to the type information,
BinDeep selects the corresponding deep learning model for similarity comparison. Specifically, BinDeep uses the
Siamese neural networks, which combine the LSTM and CNN to measure the similarities of two target functions.
Different from the traditional deep learning model, our hybrid model takes advantage of the CNN spatial
structure learning and the LSTM sequence learning. The evaluation shows that our approach can achieve good
BCSD between cross-architecture, cross-compiler, cross-optimization, and cross-version binary code.

1. Introduction different optimization levels, targeting at different CPU architectures. As


a consequence, the target compiled code will be syntactically different
With the rapid development of information technology, a large between cross-compiler, cross-optimization, and cross-architecture bi­
number of software are widely used in personal computers and IOT naries. Due to the syntactic difference, it is difficult to recognize the
devices with different CPU architectures. As a result, more and more similar functions between different binary code.
people are paying attention to software security. Malicious developers Previous solutions to deal with the binary code similarity problem
may insert malicious code into their software. For instance, some can be divided into two categories: static analysis and dynamic analysis.
venders exploit the malware technique for commercial software pro­ In general, these solutions suffer from the following limitations: (1) most
motion. Even if the software is developed by a trusted developer, it may of the existing methods heavily rely on the engineered syntactic features
contain one or more security vulnerabilities due to a programming (e.g., control-flow graph) of the binary code for similarity comparison,
mistake. Considering the intellectual property protection, most of soft­ (2) some methods incur considerable performance cost due to adopting
ware developers will not provide the source code of their product for the time consuming mechanisms (e.g., graph matching and emulation),
security analysis. (3) some methods cannot well compare the binary code similarity across
To address these issues, a number of binary code analysis methods different CPU architectures and different program versions.
have been proposed in recent years. In this paper, we focus on the To address these problems, in this paper, we present a novel binary
method for binary code similarity detection (BCSD), which is useful for similarity detection framework, BinDeep, by utilizing the deep learning
malware analysis and vulnerability discovery. By conducting the simi­ techniques. Different from previous solutions, our method does not need
larity comparison to the known binary functions, we can identify the any manually selected feature for the similarity computation.
corresponding vulnerabilities (or malicious functions) in the different The basic idea of our approach is to leverage the siamese neural
binary code. network to measure the similarity of binary functions. To obtain the
The same source code can be compiled with different compilers and input for the neural network, we first disassemble the binary code and

* Corresponding author.
E-mail addresses: [email protected] (X. Jia), [email protected] (R. Ma), [email protected] (C. Hu).

https://doi.org/10.1016/j.eswa.2020.114348
Received 6 December 2019; Received in revised form 30 June 2020; Accepted 17 November 2020
Available online 3 December 2020
0957-4174/© 2020 Elsevier Ltd. All rights reserved.
D. Tian et al. Expert Systems With Applications 168 (2021) 114348

extract the instruction sequence as the features. Next, we utilize the traditional Siamese structure, we combine the CNN and LSTM models
classical natural language processing model to convert the instruction for the neural network construction. After all these networks are well
sequences into the vectors. Considering different comparison scenario, trained, they can convert the similar (or dissimilar) binary functions into
we use different siamese neural network for similarity measurement. For similar (or dissimilar) vectors. By computing the distance between two
this purpose, we apply a deep learning model to identify the specific binary functions, we can get their similarity value. If the value is smaller
types of two functions to be compared. Unlike the conventional Siamese than the predefined threshold, we think the two binary functions are
network structure, we make use of the hybrid network structure, which similar. Otherwise, they are dissimilar.
combines CNN and LSTM network to measure the binary function sim­
ilarity. Since CNN can extract the local spatial features while LSTM is
2.3. Feature extraction and processing
capable of extracting sequential features automatically, the hybrid
neural network model could improve the similariy detection.
In previous approaches, most of them rely on the prior knowledge for
We have implemented our approach based on IDA Pro (Hex-Rays,
feature extraction. In our solution, we just utilize instruction sequence as
2018) and Keras (Keras Team, 2019). To evaluate the effectiveness of
the features. For this purchase, we exploit IDA Pro to dissemble the bi­
our method, we prepare a custom dataset, which consists of more than
nary code and then get an instruction sequence for each function. For
47 million function pairs. The experiments show that our solution can
simplicity, the internal control flow of a function is not considered.
identify similar and dissimilar function pairs effectively on cross-
Similar to the recent studies (Massarelli, Di Luna, Petroni, Baldoni, &
architecture, cross-compiler, cross-optimization, and cross-version
Querzoni, 2019; Zuo et al., 2019), we make use of the NLP (Nature
binaries.
Language Processing) model to build our instruction embedding. In
In summary, we make the following contributions:
general, an instruction can be divided into two parts: one opcode and
one (or more) operand(s). The number of opcode type is limited while
• We propose a novel deep learning based solution for binary code
the representation of the operands changes a lot across different
similarity detection. This model utilizes the hybrid siamese neural
computing scenes. To address this problem, a straightforward method is
network to measure the binary code similarity.
to use only instruction opcodes as tokens for instruction embedding,
• We use the instruction embedding model to vectorize the extracted
omitting instruction operands. However, doing so will result in the in­
instructions. To identify the types of functions to be compared, we
formation loss. In fact, instruction operands contain important semantic
apply a deep learning classification model.
information for similarity comparison.
• We conduct extensive experiments to evaluate our approach. The
To keep the operand information, the normalization processing is
experimental results show that BinDeep can achieve an average
need. Different from the recent method (Massarelli et al., 2019; Zuo
precision, recall, and F1 Score of 97.07%, 98.88%, and 97.97%
et al., 2019), we propose a simple but effective way to normalize the
respectively for BCSD.
instruction operands. Specifically, we classify the common operands
into 8 different categories: General Register, Direct Memory Reference,
2. Approach
Memory Ref [Base Reg + Index Reg], Memory Reg [Base Reg + Index
Reg + Displacement], Immediate Value, Immediate Far Address, Im­
2.1. Problem statement
mediate Near Address and Other Type. Table 1 shows the examples of
instruction normalization for the x86 architecture.
The key task of our study is to judge whether the two functions I1 , I2
After the raw instructions are normalized, the next step is to perform
from different binary code are similar or not. If the two functions are the
the instruction embedding. There are two common methods for the
compiled results from the same original source code, they are similar.
embedding: one-hot encoding (Wikipedia, 2018) and word2vec (Gen­
Otherwise, they are dissimilar. It is worth noting that the target binaries
sim, 2018). The one-hot encoding is very simple to represent an in­
may be compiled using different compilers with different optimization
struction, but it cannot capture the relevance of two similar instructions.
levels, and they may also come from different CPU architectures and
On the contrary, the word2vec method is capable of converting similar
different program versions. Due to the complications arising from
instructions to similar vectors. For example, by using word2vec, the add
different compilation, it is a non-trivial task to achieve accurate simi­
and sub instructions will be converted into the similar vectors. The
larity comparison for two binary functions.
word2vec contains two different models: CBOW (Continuous Bag of
Words) and skip-gram. Compared with the CBOW model, the skip-gram
2.2. The Framework of BinDeep
model can achieve better performance on a large dataset. Therefore, we
make use of the skip-gram model to build our instruction embedding.
As shown in Fig. 1, the framework of BinDeep can be divided into
The basic idea of the skip-gram model is to utilize the context in­
three stages. In the first stage, we exploit IDA Pro to analyze the binary
formation (i.e., a sliding windows) to learn word embeddings on a text
code statically. After disassembling the instructions, we get an instruc­
stream. For each word, the model will initially set a one-hot encoding
tion sequence for each function in the binary code. For the convenience
vector, and then it gets trained when going over each sliding window.
of later using the deep learning model, we leverage a NLP model to
The key point of the model is to figure out the probability P of an
perform the instruction embedding. By doing so, the instruction se­
arbitrary word wk 1 in a sliding window Ct 2 given the embedding → w t 3 of
quences are converted into the vectors.
the current word wt . For this purpose, the softmax function is used as
In the second stage, we make use of a deep learning model to identify
the CPU architectures and optimization levels of target binary functions.
According to the identified function types, we will select a proper model 1
A sentence W consists of several words wk , and it can be represented as W =
for the binary code similarity detection. The main advantage of this
(w1 , …, wn ), wk ∈ R, 1⩽k⩽n,n refers to the number of words in a sentence.
stage is that we can improve the similarity detection with pertinence. 2
A sliding window C is a series of small part of a sentence W, and it can be
Thanks to this stage, different comparison scenarios will result in represented as
adopting different comparison models in the next stage. ( )
C = (c1 , …, cm ), ci = wj , …, wj+T− 1 , 1⩽i⩽m, 1⩽j⩽n − T +1, wj ∈ R, m refers to
In the third stage, we utilize the Siamese neural networks to detect the number of fixed windows, n refers to the number of words in a sentence W,
the binary code similarity. There are three Siamese network models, and and T refers to the size of a fixed window. In practice, for each word wj , we use
each one corresponds to a different comparison scenario. For simplicity, a floating-point number for representation, and each one has the same
all these three Siamese neural networks have the same structures, but accuracy.
their network parameters are not identical. Different from the 3 →
w t ∈ RL , Lrefers to the vector dimension.

2
D. Tian et al. Expert Systems With Applications 168 (2021) 114348

Fig. 1. The Similarity Detection Framework of BinDeep. The input to the framework is two binary functions. The output is the similarity value of these two functions.

architectures, and different CPU architectures with different compila­


Table 1
tion and optimization. Accordingly, we use three different Siamese
Instruction normalization examples.
neural network models for these similarity comparison. As mentioned
Original instruction Normalized instruction previously, these models have the same network structures.
push ebp push TypeOne To select a proper neural network model, we need to identify the
mov ebp, esp mov TypeOne, TypeOne specific types of two functions to be compared. For this purpose, we
sub esp, 18 h sub TypeOne, TypeFive
make use of a RNN-based classifier. The classifier consists of two models:
mov dword ptr [esp + 74 h], 0 mov TypeThree, TypeFive
lea eax, [ebp + esi*2 + 0] lea TypeOne, TypeFour one model to identify the CPU architecture (x86/x64/arm) and the other
cmp eax, 210 h cmp TypeOne, TypeFive model to identify the optimization level (O0/O1/O2/O3). In each hid­
jl loc_804B271 jl TypeSeven den layer of the RNN model, the output of the next moment is deter­
mov eax, 1 mov TypeOne, TypeFive mined by the output of the current moment and the input of the next
call _exit call TypeSix
moment. Thanks to this recycling design, the RNN model can well
handle the sequential information such as sentences in texts. Since the
follows: instruction sequences are similar to the sentences, the RNN model is
( ) suitable for the classification task to identify the CPU architecture and
exp →
wt → optimization level. Although the RNN model is powerful for processing
T
wk
P(wk |wt ) = ∑ ( →T → )
exp w t w i the sequential data, it is hard to train this model due to the back prop­
wi ∈Ct agation gradient diffusion or gradient explosion. To address this prob­
lem, we make use of the LSTM model.
T
where → w k and →
w i are the embeddings of words wk and wi , → wt →
w i is the Fig. 3 illustrates the architecture of our LSTM-based neural network
similarity of two words wt and wi . To train the model on a sequence of T for the compiling optimization level identification. This neural network
words, we utilize stochastic gradient descent to minimize the log- consists of three layers. The first layer is the embedding layer. By reusing
likelihood objective function J(w)defined as follows: the pre-trained weights from the word2vec model, this layer can map
the instruction sequence to 300 dimensional vectors. The input to this

T ∑
→ (→ →) →
J(w) = − logP(wk |wt ) layer is the instruction embedding vectors I = i 1 , …, i n , i j ∈ R,
1⩽j⩽n, n = 1000, and the output of this layer is another instruction
t=1 wk ∈Ct

Fig. 2 illustrates the examples of the instruction embedding. The → (→ → ) →


embedding vectors l = l 1 , …, l m , l j ∈ RL , 1⩽j⩽m, m = 1000, L =
input of this model is an instruction, the output is a 300-dimensional
vector. In other words, the instruction sequence will be represented as 300.The second layer is a LSTM layer, and its output dimension is 50.
multiple 300-dimensional vectors. Section 3.4 shows the effect of Each LSTM unit contains the following equations:
different embedding dimension on the experiments. it = sigmoid(W 50 50
( i xt + Ui ht− 1 + bi )) it ∈ R50 , bi ∈ R50
ft = sigmoid Wf xt + Uf ht− 1 + bf ft ∈ R , bf ∈ R
2.4. Model classification ̃ct = tanh(Wc xt + Uc ht− 1 + bc ) ct ∈ R50 , bc ∈ R50
̃
ct = it ⊙ ̃ct + ft ⊙ ct− 1 ct ∈ R50
ot = sigmoid(Wo xt + Uo ht− 1 + bo ) ot ∈ R50 , bo ∈ R50
After the instruction sequence of functions are vectorized, the next
ht = ot ⊙ tanh(ct ) ht ∈ R50
step is to train the deep learning based comparison model so that the
similarities of two functions can be well measured. Previous methods In the above equations,ct , it , ft , ot denote the memory state, input,
(Liu et al., 2018; Massarelli et al., 2019) use only one single deep forget, output gates at time t ∈ {1, …, T}, ⊙ operator is the point-wise
learning model for similarity detection between cross-architecture, multiplication, W, U are the weight matrices, and b is the bias parameter.
cross-compiler, cross-optimization, and cross-version binary code. The final layer is a dense output layer with four neural units, and its
Considering heterogeneous comparison targets, using one deep learning output is the classification result. To facilitate the multi-classification,
model may not achieve good detection accuracy in some cases. To deal we utilize the softmax as the activation function. For training this neu­
with this issue, we adopt different deep learning models for different ral network, we use the categorical_crossentropy as our loss function,
comparison scenarios. In other words, we use a tailored model for each and apply adam as the optimizer. To prevent the trained neural network
comparison scenario. Specifically, we classify the comparison scenarios getting overfitting, we use dropout and set its rate to 0.5.
into three categories: the same CPU architectures with different
compilation, the same optimization levels with different CPU

3
D. Tian et al. Expert Systems With Applications 168 (2021) 114348

Fig. 2. The process of converting assembly instructions into embedding vectors.

at spatial structure learning, while the LSTM model excels at sequence


learning. Our hybrid model takes both advantages of these two models
so that the deep learning-based similarity detection could be improved.
For ease of representation, we use the term hybrid model and CLSTM (i.
e., CNN + LSTM) model interchangeably.
Fig. 5 illustrates the structure of the hybrid neural network. It con­
sists of 5 layers. The first embedding layer is used for mapping the in­
struction sequence to a fixed-dimensional vector. This layer can be

represented by a mapping function f : X→ X , where X is the instruction

vector, and X is the mapping vector. X = (x1 , …, x1000 ), xj ∈ R, 1⩽j⩽

1000; X = (→ x , …, →
1 x x ∈ RL , 1⩽j⩽1000, L = 300.This layer re­
), →
1000 j
uses the pre-trained weights from the word2vec model. The second layer
Fig. 3. Neural network architecture for model classification. The input to the is a LSTM layer, and its output dimension is 20. The output of this layer
network is a binary function. The output is the identification type of can be expressed as follows:
this function. (→)
H = LSTM X
W1 ,b1
2.5. Similarity comparison
where LSTM represents the mathematical operations performed at LSTM
After the types of two binary functions are identified, we select the units, W1 and b1 are their weight and bias parameters. H = (h1 , …,
corresponding neural network model for computing the similarity. h1000 ), hj ∈ RL , 1⩽j⩽1000, L = 20.The third layer is a CNN layer, which
Motivated by the recent method on assessing semantic similarity be­ contains 300 one-dimensional convolution filters4. The length of the 1D
tween sentences (Mueller & Thyagarajan, 2016), we leverage the Sia­ convolution window is set to 5, and the convolution stride is set to one.
mese network for measuring the similarity of two binary functions. The The CNN layer is aimed at extracting local features from the embedding
basic idea of the Siamese network is to use two identical mapping layer. The output of this layer can be represented as follows:
function for converting the two inputs into two fixed-dimensional vec­ ( )
tors. By comparing the distance of these two vectors, we can figure out A = σ Conv(H)W2 ,b2
the similarity of two original inputs. As shown in Fig. 4, the Siamese
network architecture consists of two identical embedding neural net­ where σ is the ReLU function, Conv represents the convolution opera
works. Different from the standard methods, we combine the CNN and
LSTM models to construct the neural networks. The CNN model is good

Fig. 5. Hybrid neural network architecture.

Fig. 4. Siamese network architecture for similarity comparison. The input to


the network is two binary functions. The output is the similarity compari­
4
son result. 1D-Convolution processing is usually used in NLP.

4
­
D. Tian et al. Expert Systems With Applications 168 (2021) 114348

tion, W2 and b2 are the weight and bias parameters of the convolutional networks are well trained within 50 epochs.
filters. A = (a1 , …, am ), ai ∈ RL , 1⩽i⩽m, L = 996, m = 300.The fourth
layer is a Max pooling layer. It is used to simplify the extracted features.
3.1. Dataset
The size of the max pooling windows is set to 2, and the stride is also set
to 2. For simplicity, the max pooling operation can be represented as the
To prepare the dataset, we select 6 popular Linux packages,
expression: Ã = Max(A). A ̃ = (ã1 , …, ã ̃i ∈ RL ,1⩽i⩽m,L = 498,m =
m ), a including coreutils, findutils, diffutils, sg3utils, and util-linux,. After
300.The final layer is a dense layer, which can help connecting the local getting the package source code, we use three CPU architectures (x86,
features. Its output can be represented as follows: x86-64 and ARM) and two compilers (gcc and clang) with four optimi­
( ) zation levels (O0, O1, O2, and O3) to compile each program. For x86 and
O = σ W3 ⋅A ̃ + b3
x86-64 architectures, the compilers are allowed to use the extended
instruction set (e.g., MMX and SSE). If the two binary functions are
where ⋅ represents the dot product operation, W3 and b3 are the weight compiled from the same source code, they are matched. Otherwise, they
and bias parameters of this layer. O = (o1 , …, om ), oi ∈ R, 1⩽i⩽m, m = are not matched.
300. To facilitate the supervised learning, the proportions of positive and
The inputs to the Siamese network are two binary functions, namely negative samples (i.e., pairs of matched and unmatched functions) are
I1 and I2 . These two functions may be compiled from different CPU ar­ relatively balanced. To get the positive training samples, we use the
chitectures, compilers, optimization levels, and program versions. The unstripped information in the binary code to identify the matched
input length is set to 1000. If the function contains less than 1000 in­ functions in the different compiled files. To get the negative training
structions, we use the nop instructions as paddings. On the other hand, if samples, we randomly select functions in the different binary files with
the function contains more than 1000 instructions, we will truncate the the different function names. The labels for the positive samples and
tail instructions. The outputs of the embedding layers of the Siamese negative samples are ones and zeros.
network are the two embedding vectors, namely f(I1 , θ) and f(I2 , θ), In total, we obtain 4729140 samples. As shown in the Table 2, these
where f5 represents the hybrid network structure, and θ represents the samples can be divided into 5 categories: cross-compiler, cross-optimi­
parameters of this network structure. We assume the embedding zation, cross-version, cross-architecture, and mixed function pairs. To
dimension is m. Additionally, there is an indicator input y to the Siamese evaluate the effectiveness of our method on unseen binary code, the
network, indicating whether the two input are similar or not. Precisely, whole dataset is split into three disjoint subsets for training, validation
if y is equal to 1, it indicates the two binary functions are similar, if y is 0, and testing. We set the proportion of these three subsets to 4:1:1. The
it indicates the two binary functions are dissimilar. debug symbol information are all stripped in these samples.
To define the loss function of the network, we leverage the
contrastive loss function (Hadsell, Chopra, & LeCun, 2006). The basic
idea of this loss function is to maximize the distance between two dis­ 3.2. Evaluation metrics
similar inputs, but to minimize the distance between two similar inputs.
For this purpose, the loss function is defined as follows: In order to evaluate the performance of our method, we make use of
the standard metrics: accuracy, precision, recall, F1, and TPR, which are
L(θ) = Average{y⋅D(I1 , I2 ) + (1 − y)⋅max(0, 1 − D(I1 , I2 ))} defined as follows:
( ) TP + TN
∑ (1)
m
Accuracy =
D I1 , I2 = |f(i1k , θk ) − f(i2k , θk )| TP + TN + FP + FN
k=1
TP
where D(I1 ,I2 ) denotes a Manhattan distance between two binary Precision = (2)
TP + FP
functions. Training the Siamese network is to find the parameters θ to
minimize the loss function. To this end, we make use of the Adam with Recall =
TP
(3)
standard back propagation algorithm. TP + FN
After the Siamese network is well trained, we can infer two state­
Precision ∗ Recall
ments. When two binary functions are similar (i.e., y = 1), their Man­ F1 = 2 ∗ (4)
Precision + Recall
hattan distance should be close to zero so that loss value will be minimal.
When two binary functions are dissimilar (i.e., y = 0), their Manhattan FP
distance should be close to one6, and the loss function value will still be FPR = (5)
FP + TN
minimal.
In the above formulas, The True Positive (TP) represents the number
3. Evaluation of correctly identified matched function pairs. The False Positive (FP)
refers to the number of wrongly identified function pairs when the deep
Our experiments are carried out on a Dell T360 Server equipped with learning model identifies the unmatched function pairs as matched. The
two Intel Xeon E5-2603 V4 CPUs, 16 GB memory, 2 TB hard drives, and
one NVIDIA Tesla P100 12 GB GPU card. We implement a plug-in of the Table 2
tool IDA Pro 7.0 to extract an instruction sequence from each binary Description of sample types.
function. These network models are implemented in TensorFlow-1.8 Sample type Number Remarks
(Abadi et al., 2016) and Keras-2.2 (Keras Team, 2019). All these Cross-compiler 855136 Only the compilers are different
function pairs
Cross-optimization 855136 Only the optimization levels are different
→ function pairs
5
f : (I, θ)→ I is a parameterized function that takes a binary function I = (i1 , Cross-version function 501028 Only the function versions are different
→ pairs
…, i1000 ), ij ∈ R, 1⩽j⩽1000, as inputs, and outputs an embedding vector I =
(→ ) Cross-architecture 1282704 Only the CPU architectures are different
→ →
i 1 , …, i 300 , i j ∈ R, 1⩽j⩽300. function pairs
6 Mixed function pairs 1235136 The Compilers, optimization levels,versions,
The minimal margin distance between two dissimilar functions is set to one
and architectures may be all different
according to our experiments.

5
D. Tian et al. Expert Systems With Applications 168 (2021) 114348

True Negative (TN) represents the number of correctly identified un­ Table 4
matched function pairs. The False Negative (FN) refers to the number of Results under different embedding dimension in the LSTM model.
wrongly identified unmatched function pairs. Accuracy refers to the Embedding dimension Accuracy Precision Recall F1 Score FPR
percentage of function pairs that are identified correctly. Precision
100 0.9062 0.8964 0.8526 0.8739 0.0609
measures the percentage of matched function pairs that are correctly
200 0.9107 0.9021 0.8599 0.8804 0.0578
labeled. Recall represents the ability to identify matched function pairs 300 0.9248 0.9208 0.8776 0.8986 0.0466
correctly. FPR measures the percentage of unmatched function pairs that
are incorrectly labeled as matched ones. F1 score refers to the harmonic
mean of Precision and Recall.
Table 5
Results under different embedding dimension in the CLSTM model.
3.3. Effect of the feature processing
Embedding dimension Accuracy Precision Recall F1 Score FPR

In general, we use the instruction sequence as the features. An in­ 100 0.9820 0.9692 0.9844 0.9767 0.0195
struction consists of one opcode and one (or more) operand(s). Some 200 0.9812 0.9690 0.9825 0.9757 0.0196
methods (HaddadPajouh, Dehghantanha, Khayami, & Choo, 2018) use 300 0.9843 0.9707 0.9888 0.9797 0.0185

the opcode as the feature, omitting the operands, while our method uses
the whole instruction as the feature. To compare the effect of the conduct a set of tests. Table 6 and Table 7 show the comparison results of
different feature processing on the similarity identification, we conduct the LSTM and CLSTM neural network models when the classification
the corresponding experiments. As shown in the Table 3, our method is model is enabled/disabled. With help of the classification model, the
better than the approaches that only use the opcode as features in identification accuracy, precision, recall, and F1 score are all improved,
various evaluation metrics, including accuracy, precision, recall, F1 and the FPR is decreased in the LSTM and CLSTM models. In particular,
score, and FPR. The main reason is that the whole instruction contains the accuracy, precision, recall, and F1 score of the CLSTM model are
more information than the opcode. increased by 6.38%, 6.79%, 10.66% and 8.73%, the FPR of the CLSTM
model is decreased by 4.04%. The main reason for the effectiveness of
3.4. Effect of the embedding dimension adding the classification model is that we can utilize more targeted
neural network model for the similarity detection.
In this part, we explore the effect of the embedding dimension on the
identification results. For this purpose, we analyze the performance of
the LSTM and CLSTM neural network models for the mixed function 3.7. Effect of the neural network structure
pairs with different embedding dimension. Table 4 shows the identifi­
cation metrics of the LSTM model. When the embedding dimension is To explore the effect of the network structure on the similarity
increased, the identification result will be better. For the CLSTM model measurement, we carry out a set of experiments in different scenarios.
shown in the Table 5, when the embedding dimension is changed from As mentioned previously, the dataset can be divided into 5 different
100 to 300, the identification result is relatively stable. These experi­ categories, which are corresponding to different comparison scenarios.
ments show the CLSTM model is more robust than the LSTM model on Regarding the performance on the mixed dataset, Table 8 shows the
the embedding dimension setting. evaluation results of the CNN, LSTM and CLSTM neural network models.
In these models, the embedding dimension is set to 300, and the number
3.5. Effect of the number of hidden unites of hidden unites is set to 18. From this table, we can see the CLSTM
model has obviously better performance than the CNN and LSTM
To examine whether the number of hidden unites affects the iden­ models. Table 9, Table 11, and Table 10 show the performance results on
tification results, we carry out the experiments under different number the cross-architecture, cross-compiler and the cross-optimization data­
of hidden unites in the neural network from 2 to 20. Fig. 6a and 6b show sets respectively. Similarly, the CLSTM model has better performance in
the results on accuracy, recall, F1 score, and FPR when using the LSTM these experiments. For the performance on the cross-version dataset,
and CLSTM neural network models. In general, as the number of hidden Table 12 illustrates the evaluation result. In this evaluation, the dataset
unites increase, the accuracy and recall of these models are increased, consists of 6 versions of the GNU Core Utilities, including the latest
the FPR are decreased. In the LSTM model, the identification metrics are version 8.31. This evaluation also demonstrates the CLSTM model is
similar when setting the number of hidden unites to 16, 18, and 20. superior to the CNN and LSTM models.
Considering the more hidden unites will result in more computing cost,
we think setting the unite number to 16 is a good compromise between 4. Discussions
effectiveness and efficiency. In the CLSTM model, when the number of
hidden unites is 18, the various evaluation metrics are optimal. Similar to the previous studies (Liu et al., 2018; Massarelli et al.,
2019; Shalev & Partush, 2018; Xu et al., 2017; Zuo et al., 2019), our
3.6. Effect of adding the classification model method is limited to cope with the obfuscated binary code. Before
applying our approach, the deobfuscation procedure is needed to first
Previous methods only use a single neural network model to measure extract the internal logic from the obfuscated code. To this end, we could
the similarities of two functions. Different from these methods, we first leverage the recent deobfuscation techniques (Yadegari, Johannes­
utilize the LSTM based classification model to identify the function types meyer, Whitely, & Debray, 2015; Xu, Ming, Fu, & Wu, 2018). We plan to
and then select the proper neural network model for the similarity explore the combination of our current method and the deobfuscation
measurement. To show the effeteness of the classification model, we technique as our future work.
To further improve the detection accuracy of our method, a potential
Table 3 solution is to use different neural network structures for different com­
Results under different instruction features. parison scenarios. For example, we could apply the BiLSTM model for
BCSD across different architectures, and use the CLSTM model for BCSD
Instruction feature Accuracy Precision Recall F1 Score FPR
across different compilers. In addition, we may consider more compar­
Opcode 0.9763 0.9654 0.9725 0.9689 0.0215 ison scenarios, which will correspond to more network models. To
Opcode + Operand 0.9851 0.9705 0.9913 0.9808 0.0187
measure the similarity distance, we could explore a different method.

6
D. Tian et al. Expert Systems With Applications 168 (2021) 114348

Fig. 6. Results under different hidden unites in the LSTM and CLSTM models.

Table 6 Table 11
Results of the LSTM and classification model + LSTM. Comparison results of LSTM, CNN and CLSTM models on the cross-compiler
Model Accuracy Precision Recall F1 Score FPR
dataset.
Model Accuracy Precision Recall F1 Score FPR
LSTM 0.8830 0.8788 0.8493 0.8637 0.0908
Classification 0.9263 0.9128 0.8746 0.8932 0.0515 LSTM 0.9069 0.8959 0.8798 0.8878 0.0733
model + LSTM CNN 0.9023 0.9031 0.8699 0.8861 0.0721
CLSTM 0.9727 0.9312 0.9956 0.9623 0.0296

Table 7 Table 10
Results of the CLSTM and classification model + CLSTM. Comparison results of LSTM, CNN and CLSTM models on the cross-optimization
Model Accuracy Precision Recall F1 Score FPR dataset.

CLSTM 0.9206 0.9023 0.8829 0.8924 0.0591 Model Accuracy Precision Recall F1 Score FPR
Classification 0.9844 0.9702 0.9895 0.9797 0.0187 LSTM 0.9304 0.9179 0.8812 0.8991 0.0459
model + CLSTM CNN 0.8998 0.9149 0.8679 0.8907 0.0718
CLSTM 0.9913 0.9860 0.9955 0.9769 0.0123

Table 8
(2016) utilize the numeric features to identify the potential candidate
Comparison results of LSTM, CNN and CLSTM models on the mixed dataset.
functions, and then exploit the structural features for similarity
Model Accuracy Precision Recall F1 Score FPR computation. Chandramohan et al. (2016) present a selective inlining
LSTM 0.9228 0.9110 0.8808 0.8956 0.0534 technique to capture function semantics and use this technique for bi­
CNN 0.9029 0.8824 0.8699 0.8761 0.0900 nary search in different CPU architectures and operating systems. Shalev
CLSTM 0.9868 0.9700 0.9911 0.9804 0.0189
and Partush (2018) employ a machine learning method for BCSD. For
cross-architecture vulnerability search in binary firmware, Zhao et al.
(2019) propose a novel solution based on kNN-SVM and attributed
Table 9 control flow graph. Wang, Shen, Lin, and Lou (2019) develop a staged
Comparison results of LSTM, CNN and CLSTM models on the cross-architecture firmware function similarity analysis approach, which considers the
dataset. invocation relations as important features. Feng et al. (2016) present a
Model Accuracy Precision Recall F1 Score FPR Graph-based method for bug search across different architectures. By
LSTM 0.9204 0.9025 0.8871 0.8947 0.0619 converting the CFGs into the numeric feature vectors, this approach can
CNN 0.9093 0.8939 0.8649 0.8791 0.0897 achieve real-time search. To improve the detection performance, Xu
CLSTM 0.9800 0.9690 0.9859 0.9804 0.0245 et al. (2017) propose a novel neural network-based approach for BCSD.
Recently, Liu et al. (2018) make use of a deep neural network (DNN) to
cope with the challenges of BCSD. For the final similarity measurement,
For example, we may utilize Hamming distance of static binary features
this approach still relies on the manually selected inter-function fea­
(Taheri et al., 2020) for BCSD. Considering the recent data poising at­
tures. Massarelli et al. (2019) propose a solution to generate function
tacks on machine learning models, we may leverage the existing work
embeddings based on a self-attentive neural network. These embeddings
(Taheri, Javidan, Shojafar, Vinod, & Conti, 2020) to implement the
can easily used for computing binary similarity. Zuo et al. (2019) make
defense.
use of NLP techniques to resolve the code equivalence problem and code
containment problem. In industry, BinDiff (Zynamics, 2018) is a popular
5. Related work
tool to identify similar functions in different binaries. The main ad­
vantages of static analysis methods are the efficiency, simplicity, and
Static Analysis. The basic idea of static analysis methods is to
scalability. Our method belongs to this category. Different from the
analyze the program code statically without executing it. David and
previous methods, our approach does not require the prior knowledge to
Yahav (2014) propose a tracelet matching method for computing simi­
extract syntactic features for BCSD.
larity between functions. Eschweiler, Yakdan, and Gerhards-Padilla
Dynamic Analysis. Compared with the static analysis methods,

7
D. Tian et al. Expert Systems With Applications 168 (2021) 114348

Table 12 the work reported in this paper.


Comparison results of LSTM, CNN and CLSTM on the cross-version dataset.
Model Version Accuracy Precision Recall F1 Score FPR Acknowledgements
LSTM 5.0 0.6424 0.6576 0.6411 0.6599 0.2904
6.0 0.7241 0.7267 0.7190 0.7254 0.2888 We would like to thank Dr. Wenmao Liu from NSFOCUS for his
7.0 0.7409 0.7399 0.7536 0.7404 0.2565 constructive comments. This work was supported in part by Strategic
8.0 0.8133 0.8402 0.8209 0.8265 0.1409 Priority Research Program of Chinese Academy of Sciences (No.
8.30 0.9888 0.9735 0.9743 0.9738 0.0377 XDC02010900), National Key Research and Development Program of
CNN 5.0 0.6289 0.6304 0.6253 0.6296 0.3104 China (No. 2016QY04W0903), Beijing Municipal Science and Tech­
6.0 0.6920 0.7156 0.6889 0.7036 0.2818 nology Commission (No. Z191100007119010) and National Natural
7.0 0.7165 0.7085 0.7212 0.7125 0.2765
Science Foundation of China (No. 61772078 and No. 61602035), CCF-
8.0 0.7787 0.7792 0.7867 0.7789 0.1963
8.30 0.9798 0.9774 0.9659 0.9716 0.0584 NSFOCUS Kun-Peng Scientific Research Foundation, and Open Found
of Shanxi Military and Civilian Integration Software Engineering Tech­
CLSTM 5.0 0.7135 0.7103 0.7036 0.7069 0.1221
nology Research Center.
6.0 0.7868 0.7834 0.7765 0.7799 0.1325
7.0 0.8154 0.8053 0.8175 0.8113 0.0899
8.0 0.8732 0.8942 0.8766 0.8854 0.0321 References
8.30 0.9960 0.9925 0.9995 0.9908 0.0075
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). Tensorflow: A
system for large-scale machine learning. In Proceedings of the 12th USENIX conference
dynamic analysis methods need to run the program code in the execu­ on operating systems design and implementation (pp. 265–283). Berkeley, CA, USA:
USENIX Association.
tion environment. Pewny, Garmany, Gawlik, Rossow, and Holz (2015) Chandramohan, M., Xue, Y., Xu, Z., Liu, Y., Cho, C. Y., & Tan, H. B. K. (2016). Bingo:
propose a dynamic approach to identify vulnerabilities cross- Cross-architecture cross-os binary search. In Proceedings of the 2016 24th ACM
architecture. This method uses the I/O behavior of basic blocks for SIGSOFT international symposium on foundations of software engineering (pp. 678–689).
New York, NY, USA: ACM.
similarity measurement. Egele, Woo, Chapman, and Brumley (2014)
David, Y., & Yahav, E. (2014). Tracelet-based code search in executables, in. In
present a novel dynamic equivalence testing primitive to obtain the Proceedings of the 35th ACM SIGPLAN conference on programming language design and
semantics of a function. This approach can effectively identify the implementation (pp. 349–360). ACM.
Egele, M., Woo, M., Chapman, P., & Brumley, D. (2014). Blanket execution: Dynamic
similar binaries with significant syntactic differences. Jhi et al. (2015)
similarity testing for program binaries and components. In Proceedings of the 23rd
develop a emulation-based method for software plagiarism detection. USENIX Conference on Security Symposium (pp. 303–317). Berkeley, CA, USA:
Wang and Wu (2017) propose an in-memory fuzzing method for binary USENIX Association.
code similarity analysis. Similarly, Hu et al. (2018) combine the Eschweiler, S., Yakdan, K., & Gerhards-Padilla, E. (2016). discovre: Efficient cross-
architecture identification of bugs in binary code, in. In Proceedings of the 2016
emulation technique and Longest Common Subsequence (LCS) algo­ network and distributed systems security symposium (NDSS).
rithm to detect binary clone functions. In general, the dynamic analysis Feng, Q., Zhou, R., Xu, C., Cheng, Y., Testa, B., & Yin, H. (2016). Scalable graph-based
methods will introduce considerable performance cost. Consequently, bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC conference
on computer and communications security (pp. 480–491). New York, NY, USA: ACM.
they may not be widely applied in some cases. In addition, most of the Gensim (2018). Word2vec embeddings. http://radimrehurek.com/gensim/models/wo
existing dynamic analysis methods cannot well handle the binaries rd2vec.html.
across different architectures for complication. HaddadPajouh, H., Dehghantanha, A., Khayami, R., & Choo, K. K. R. (2018). A deep
recurrent neural network based approach for internet of things malware threat
hunting. Future Generation Computer Systems, 85, 88–96.
6. Conclusion Hadsell, R., Chopra, S., LeCun, Y., 2006. Dimensionality reduction by learning an
invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’06). pp. 1735–1742.
In this paper, we present BinDeep, a novel deep learning based so­ Hex-Rays (2018). Ida pro disassembler and debugger. https://www.hex-rays.
lution for binary code similarity detection. We exploit IDA Pro to extract com/products/ida/index.shtml.
instruction sequence as features for target functions. To vectorize these Hu, Y., Zhang, Y., Li, J., Wang, H., Li, B., & Gu, D. (2018). Binmatch: A semantics-based
hybrid approach on binary code clone analysis. In 2018 IEEE international conference
features, we leverage a classical NLP model. Next, we apply a RNN-based on software maintenance and evolution (ICSME) (pp. 104–114).
model to identify the types of target functions. Based on these function Jhi, Y., Jia, X., Wang, X., Zhu, S., Liu, P., & Wu, D. (2015). Program characterization
type information, we select the corresponding Siamese neural network using runtime values and its application to software plagiarism detection. IEEE
Transactions on Software Engineering, 41, 925–943.
model for the similarity measurement. Compared to the previous work,
Keras Team (2019). Keras: The python deep learning library. https://keras.io/.
we combine the CNN and LSTM models to construct the Siamese neural Liu, B., Huo, W., Zhang, C., Li, W., Li, F., Piao, A., & Zou, W. (2018). alpha diff: Cross-
network. The evaluation results show that BinDeep can achieve an version binary code similarity detection with dnn, in. In Proceedings of the 33rd ACM/
IEEE international conference on automated software engineering (pp. 667–678). ACM.
average detection accuracy of 98.43% for BCSD across different archi­
Massarelli, L., Di Luna, G. A., Petroni, F., Baldoni, R., & Querzoni, L. (2019). Safe: Self-
tectures, compilers, and optimization levels. attentive function embeddings for binary similarity. In R. Perdisci, C. Maurice,
G. Giacinto, & M. Almgren (Eds.), Detection of intrusions and Malware, and
vulnerability assessment (pp. 309–329). Cham: Springer International Publishing.
CRediT authorship contribution statement
Mueller, J., & Thyagarajan, A. (2016). Siamese recurrent architectures for learning
sentence similarity. In Proceedings of the thirtieth AAAI conference on artificial
Donghai Tian proposed the method, conducted the main experi­ intelligence (pp. 2786–2792). AAAI Press.
Pewny, J., Garmany, B., Gawlik, R., Rossow, C., & Holz, T. (2015). Cross-architecture bug
ments, and drafted the manuscript. Xiaoqi Jia improved the method.
search in binary executables. In 2015 IEEE symposium on security and privacy (pp.
Rui Ma participated in the discussion of the algorithm and experiments, 709–724).
gave advice, and directed the experimental procedure. Shuke Liu car­ Shalev, N., & Partush, N. (2018). Binary similarity detection using machine learning. In:
ried out part of the experiments and wrote part of the manuscript. Proceedings of the 13th workshop on programming languages and analysis for
security. ACM, New York, NY, USA. pp. 42–47.
Wenjing Liu wrote part of the manuscript. Changzhen Hu gave some Taheri, R., Ghahramani, M., Javidan, R., Shojafar, M., Pooranian, Z., & Conti, M. (2020).
very insightful advice about the writing of this manuscript. All authors Similarity-based android malware detection using hamming distance of static binary
read and approved the final manuscript. features. Future Generation Computer Systems, 105, 230–247.
Taheri, R., Javidan, R., Shojafar, M., Vinod, P., & Conti, M. (2020). Can machine learning
model with static features be fooled: An adversarial machine learning approach.
Cluster Computing.
Declaration of Competing Interest
Wang, Y., Shen, J., Lin, J., & Lou, R. (2019). Staged method of code similarity analysis for
firmware vulnerability detection. IEEE Access, 7, 14171–14185.
The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence

8
D. Tian et al. Expert Systems With Applications 168 (2021) 114348

Wang, S., & Wu, D. (2017). In-memory fuzzing for binary code similarity analysis, in. In Yadegari, B., Johannesmeyer, B., Whitely, B., & Debray, S. (2015). A generic approach to
Proceedings of the 32Nd IEEE/ACM international conference on automated software automatic deobfuscation of executable code, in. In 2015 IEEE symposium on security
engineering (pp. 319–330). and privacy (pp. 674–691).
Wikipedia (2018). One-hot.https://en.wikipedia.org/wiki/One-hot. Zhao, D., Lin, H., Ran, L., Han, M., Tian, J., Lu, L., et al. (2019). Cvsksa: Cross-
Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., & Song, D. (2017). Neural network-based graph architecture vulnerability search in firmware based on knn-svm and attributed
embedding for cross-platform binary code similarity detection. In Proceedings of the control flow graph. Software Quality Journal.
2017 ACM SIGSAC conference on computer and communications security (pp. Zuo, F., Li, X., Young, P., Luo, L., Zeng, Q., & Zhang, Z. (2019). Neural machine
363–376). translation inspired binary code similarity comparison beyond function pairs. In
Xu, D., Ming, J., Fu, Y., & Wu, D. (2018). Vmhunt: A verifiable approach to partially- Proceedings of the 2019 network and distributed systems security symposium (NDSS).
virtualized binary code simplification, in. In Proceedings of the 2018 ACM SIGSAC Zynamics (2018). Bindiff. http://www.zynamics.com/bindiff.html.
conference on computer and communications security (pp. 442–458). ACM.

You might also like