0% found this document useful (0 votes)
185 views141 pages

Machine Learning Methods For Data Security

This document proposes a two-stage methodology for behavioral-based malware analysis and detection using random projections and K-nearest neighbors classifiers. In the first stage, random projection is used to reduce the dimensionality of dynamic feature data from sandbox execution, decreasing computational time. In the second stage, a modified K-nearest neighbors classifier with VirusTotal labeling is applied to a large set of file samples to classify malware. Files classified as false negatives are then analyzed to potentially detect previously unidentified malware missed by VirusTotal. This allows manual inspection of a reduced set of false negative files by an expert.

Uploaded by

Jozsef Hegedus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views141 pages

Machine Learning Methods For Data Security

This document proposes a two-stage methodology for behavioral-based malware analysis and detection using random projections and K-nearest neighbors classifiers. In the first stage, random projection is used to reduce the dimensionality of dynamic feature data from sandbox execution, decreasing computational time. In the second stage, a modified K-nearest neighbors classifier with VirusTotal labeling is applied to a large set of file samples to classify malware. Files classified as false negatives are then analyzed to potentially detect previously unidentified malware missed by VirusTotal. This allows manual inspection of a reduced set of false negative files by an expert.

Uploaded by

Jozsef Hegedus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

Machine learning methods for data security

Author: Jzsef Hegeds

Supervisor:
Prof. Pekka Orponen

Instructor:
Doc. Amaury Lendasse

2012

#
#

k1

0%

1 n 10

40%
0.002%

103

98.6%

74.37%

86%

97.11%

40%

85%

2.5%

0.3%

1%

3.80%

0.004%

0.02%

105 q

4000

71%
21
105

N
A tC1 , C2 , ..., Ci , ..., CN u.
A
Ci
pn
pn

tC1 , C2 , ..., Ci , ..., CN u


Ci
Ci t..., ppn , hn q, ...u
hn
gn ppn , hn q
P
pn P P
P
42

59

pn

hn q

pn

3
8

h
1000
Dip
p
Di

yppq |

p
i Di |

pp, hn q
pp, hn q

p
Dip

Ci

p
Dip

pPP

59
59

Ci u H

p PP

thn | pp , hn q P
p

i
Di59

x t 103

30

x
yppq |

Dip | N 1000
1000
i

600

y
p

x
59

5000

ypxq |ti : |Di59 | xu|

5000

number of (unique) hashes which occur exactly in a given number of files


(not neccessarily the same files)

restricted to clean
samples only

10

number of hashes

10

10

10

10

X: 335
Y: 27

10

10

100

200

300

400
500
600
number of files

700

800

900

59
335

27

59

59
59
ypxq te P
Di : |ti : e P Di u| xu .

pm
hm

vm

Abal

S
S
S

ppm , hm , vm q
m
vm P V

Abal
v

ppm , hm q ppn , hn q

|V | 24
ppn , hn q P Ci

ppm , hm , vm q P S

Abal
86000

A86k

86000

A86k
100
30
Abal

v
1700
Abal
Abal

A86k

vm

v
v
y
v

v
v

x
i
x

zi
i

zi
AV T
32683

x
zi x

32683
A86k

AV T

AV T

32683
AV T
2500
01
T
AV01

zi
T
AV01

i
i

i
T
AV T AV01
tCi P AV T : zi 0u Y tCi P A : zi 10u
zi
zi 0
zi 0

zi 10
zi 1
10

A86k
Alarge

Alarge
Alarge

gn ppn , hn q

Alarge

0 Jij 1
Jij

Ci

Cj Ci

Jij
i

Dip

i
p

Dip

Dip
Di

|Di X Dj |
cos
Jij
a
.
|Di ||Dj |

p
p
cos
Jij

1
D

g pg1 , g2 , . . . , gl , . . . , gL q,

i Di

L |D|.
01

g
Di

xi ,

xi pxi1 , ..., xil , ..., xiL q.


xil
xil

#
1 gl P Di

.
0 gl R Di

cos
Jij

L
cos
Jij

xi
x| xj
cos
Jij
i .
xi xj

xi

xi


xi {xi

xj

N 105


xi {xi


xi {xi

Di

i, j

Jac
Jij

|Di X Dj |
.
|Di Y Dj |

xil
xil

#
wl

gl P Di
gl R Di

wl
wl logpN {Nl q
N

Nl
gl

logpN {0q 0 0
wl log

N
,
Nl ` 1

logpN {0q 0 0

xi
x
i Rxi ,
R
EpR| Rq I

dL

x
|i x
j

0
E

d1

x|i xj

j q Epx|i R| Rxj q x|i EpR| Rqx|j x|i xj |Di X Dj |.


Ep
x|i x
x
|i x
j
|
x
i x
j

x|i xj

|Di X Dj |
d
d 4000

k Op2 logpnqq
1
1`

N 106

L
N 105

R
xi
d

x
i

r
x i sk
k1
d
r
x i sk 0

x
i
k

xi

RN Dp, q
N p, q

RN D

k1
d
r RN Dp 0,
r
xi sk r
x i sk ` r

g
g

x
i
x
i Rxi

j P Di

En
D

d1 q

E
n

#
En
n pEq
E,
: g g

|E|
g

M pDi q

p pDi qq

Jac |
Jz
ij
Jac
Jz
ij

pM pDi q Y M pDj qq X M pDi q X M pDj q|


.
|
n pM pDi q Y M pDj qq|

Jac
Jij

pDi q

3
x
tw1 , w2 , ..., wM u

sij
x
W H
r
}wx}

W
r
w

x
sij ai aj ai

r
w

Tp

wi hi px wi q
ai
hi N

k1

ai e2

ak

}xwi }2

sij sij ` p1 qai aj .

105

Jij

j
j

M pjq
k

M k pjq

M k pjq
M k pjq

Jij

k
j

M k pjq

Jij s
Jij s

M k pjq
M k pjq

k1

d1N N
R

R
R

pq

`
1`

p1

q`
1`

Abal
i
x

100
100
p P Pi
x
p
i

Pi

p 1 p
x
i
pPPi |Di |

i
x
,
} pPPi |Dip |1 x
p
i }

Ci

100

100

|Dip | Dip

Abal

Abal

Abal
Jij

1
}x qk }
N k xPQ
k

Qk

A86k

2000
1

100
1.3

A86k

A86k

10000

0.6

1.1

50
Tp 50

0.8
5
Tp 50 N 10000

0.2

0.8 Tp 50 N 10000
0.8 0.2
0.8
2700

20
k 2700
0.363

0.328

0.8
1.2

0.2

0.8 Tp 50 N 10000
1.2

0.8 Tp 50 N 10000
1.2
0.2
1{0.15 6.6

1.5

1
0.2
0.8 Tp 50 N 10000
1
0.2
0.8 Tp 50 N 10000
70

0.71

" 0.8

1.2

k 70

0.63

1.2
Tp

1
7
k 142

0.2
0.8 Tp 200 N 86000
N 86000
142
28
100

0.67

0.60

59

16384
4096

unpredictable

1024

Amount

256
2048
1024
512
256

true negatives

256
64
16
4
1
16440
12646
9728
7483
5756
1024
256
64
16
0

false negatives

true positives

false positives
10

20

30
40
50
60
Number k of nearest neighbors

70

80

90

k
T
AV01

6000{18612 32%

300
s0
16{2441 0.65%

k 100

100

Alarge

Ntrain q

Nval
112000
2 105

d 4000

40%

d 4000

d 4000
500

104
10 q
5

Nval
k
0

1500

1500

1500

1500

1500
15001 6.66 104 0.0666%

x
15001 6.66 104

30

2{10073 2 104
45.5%
56%

2 104

10%

45.5%

11

1
0.9

true positive rate

0.8
0.7
2009.01
2009.02
2009.03
2009.04
2009.05
2009.06
2009.07
2009.08
2009.09
2009.10
2009.11

0.6
0.5
0.4
0.3
0.2 4
10

10

false positive rate

10

10

11
k
k1

d 1000

p 59

s0

11
11

103

10%

training
training
Dclean
Dmalware

validation
validation
Dclean
Dmalware

D
4000

training
training
Btrain Dclean
Y Dmalware
validation
validation
Dclean
Y Dmalware

Bval
|Btrain | 200
|Bval | 200

x 1001

y
x 1001

s
0
s 0.1
s 0.1

s0

s0

s 0.1

s0

s0

ri ti ti

15000
59

Dip

Di3
Dip

i
Di59
i
91
91

ti
0

1
i

18%

91
15000

1
500`1
a

ti

ti
ti a
xi ` b
x
i

Di59

500
a

15000

i pti ti q
ri ti ti
0
500 ` 1

59
59

x
i
x
i
1

x
i

a
Di

ti
Di

91
x
i
91

OpkN q

OpN 2 q

4
11%

OpM q
M
OpN q

10%

103

50%

50%

59

Methodology for Behavioral-based Malware Analysis and Detection


using Random Projections and K-Nearest Neighbors Classifiers
Jozsef Hegedus, Yoan Miche, Alexander Ilin and Amaury Lendasse
Department of Information and Computer Science,
Aalto University School of Science,
FI-00076 Aalto, Finland
AbstractIn this paper, a two-stage methodology
to analyze and detect behavioral-based malware is
presented. In the first stage, a random projection is
decreasing the variable dimensionality of the problem
and is simultaneously reducing the computational
time of the classification task by several orders of
magnitude. In the second stage, a modified K-Nearest
Neighbors classifier is used with VirusTotal labeling
of the file samples. This methodology is applied to
a large number of file samples provided by F-Secure
Corporation, for which a dynamic feature has been
extracted during DeepGuard sandbox execution. As
a result, the files classified as false negatives are used
to detect possible malware that were not detected in
the first place by VirusTotal. The reduced number of
selected false negatives allows the manual inspection
by a human expert.

I. Introduction
Malware detection has been the subject of a large
number of studies (see [1], [2], [3] and [4], [5], [6], [7],
[8]), for example the work of Bailey [9] using signaturebased malware detection approach has shown that recent
malware types require additional information in order to
obtain a good detection.
In this paper, an approach based on the extraction of
dynamic features during sandbox execution is used, as
suggested in [7]. In order to measure similarities between
executable files, the Jaccard Index is used to measure the
similarities between hash values (encoding the dynamic
feature values obtained from the sandbox). The hash
values are transformed into a large number of binary
values which could be used to compute the Jaccard Index
(see [10] for original work in French or [11] in English).
Unfortunately, the dimensionality of such variable space
does not allow the use of traditional classifiers in a
reasonable computational time.
A two-stage methodology is proposed to circumvent
this dimensionality problem. In the first stage, a random
projection is decreasing the variable dimensionality of
the problem and is simultaneously reducing the computational time by several orders of magnitude. In the
second stage, a modified K-Nearest Neighbors classifier
is used with VirusTotal [12] labeling of the file samples.
This two-stage methodology is presented in section III.

The practical implementation of the methodology and


the results are discussed in section IV. The different
parameters (the random projection dimension and the
number of nearest neighbors) are also analysed in this
section.
As a global result, the methodology enables to identify
the false negatives from the classification. Such samples
can then be used to detect possible malware that were
not detected in the first place by the VirusTotal labeling.
Thanks to the methodology, the reduced number of
identified false negatives allows for a manual inspection
by a human expert.
Indeed, without this pruning of possibly malicious
samples by the presented methodology, a manual inspection will not be possible since reliable experts are scarce
and their availability is highly limited.
Using the proposed methodology and the know-how
of one F-Secure Corporation expert, it has been possible
to extract 24 malware candidates out of 2441 original
candidates from which 25% are surely malicious and
50% which are probably malicious, have to be further
investigated in order to obtain a decisive classification.
In section II, the data gathering and sample labeling
are described. Section III presents the two-stage methodology while section IV shows the practical implementation, the results and the analysis of the results.
II. Behavioral Data Gathering and Sample
labeling
The data set used in this paper is focused on behaviorbased malware analysis and detection. The former approach of signature-based malware detection cannot be
considered as sufficient anymore for reliable detection
[9], [7]. Be it because of the development of polymorphic
and metamorphic malware or the approach of flash
worms who only do some reconnaissance on the
machines/network they scan for future deployment of
targeted attacks , the need for execution level identification is important.
A. Sandboxing and Extracting Behavioral Features
In this spirit, a currently popular approach [7], [6] is
to sandbox the execution of the malware and analyze
behavioral data extracted during the execution.

B. Obtaining the Sample labeling


The VirusTotal [12] online analysis tool provides a
simple interface for sample submission, returning a list
of up to 43 (depending on the sample nature: executable,
archive. . . ) mainstream anti-virus software detection results. Among the most widely used and known are FProt, F-Secure, ClamAV, Antivir, AVG, BitDefender,
eSafe, Avast, McAffee, NOD32, Norman, Panda, Symantec, TrendMicro, VirusBuster. . . See the VirusTotal web
site for the full list of used engines [12].
The result of the submission of a sample file is the
number of engines which detected the sample as malware. Figure 1 is a histogram of the detection levels
for the set of 32683 samples used in this paper. As can
be seen, a large proportion of the set is detected by at
least one engine as malware. Less than 2500 samples are
actually not detected by any engine.
In order to make the problem a binary classification
one (i.e. identifying whether a sample should be considered malware or clean), an a priori and arbitrary
threshold has been set on the amount of engines detecting a sample as malware. It is considered that for
a sample i, if the amount mi of engines identifying the
sample as malware is such that 0 < mi < 11, then
the sample is discarded. The disadvantage is that these
samples are not considered in the whole methodology
and therefore not classified. Nevertheless, they have also

2500

2000

Number of samples

It has recently been demonstrated in [8] that the use of


public sandbox submission systems might reveal network
information regarding the sandbox machine identity.
Through submission of a decoy sample by an attacker,
it becomes possible to blacklist the hosts on which are
sandboxed the samples and have the malware circumvent
the sandbox execution and forth detection.
The Norman sandbox development kit [13] released in
2009 enables security companies to gather the behavioral
data obtained during sandboxed execution and analyze
that data with a custom engine. This avoids the pitfall
of a publicly available sandbox machine mentioned.
The results in this paper were obtained on the data of
32683 samples collected by F-Secure Corporation. The
samples data were produced by F-Secure by running the
samples through their sandbox engine [14], [15], [16],
which resulted in large numbers of feature-value pairs
extracted for each sample. Individual features may have
significant number of distinct values, and the values
come in the form of hashes. The data cannot be considered complete, as the sandbox, for instance, may not
be able to run some of the samples correctly or may miss
relevant execution paths.
The samples were labeled using an online sample
analysis tool explained in the next section.

1500

1000

500

0
-5

10

15

20

25

30

35

Number of engines detecting malware

40

45

Figure 1. Histogram made using 32683 executable samples and


querying from www.virustotal.com how many anti-virus engines
raise a flag for each sample. Thus for each sample k a number mk
is obtained. For a given value x, on the x-axis, the y-axis shows for
how many samples k it is true that mk = x.

no influence on the rest of the data set and the final


classification results.
This is equivalent to setting a certainty threshold on
the sample analysis, above which it can be considered
as indeed malware (and no more a set of false positives
from mi different engines). Therefore, samples with a
number mi of detecting engines strictly above 10 are
kept and considered as malware (with a relatively high
probability), and samples with 0 detecting engines are
kept and considered as unpredictable (and possibly
clean).
Figure 2 illustrates the pruned set of samples, with
only samples for which mi = 0 or mi > 10 are kept,
which amounts to 21053 (out of the original set of
32683): 18612 considered as malware, and 2441 as
possibly clean.
It is clear that flagging the 2441 samples for which
mi = 0 as possibly clean is likely to hide a certain
amount of false negatives (VirusTotal clearly states that
mi = 0 should in no way be considered as meaning
clean). The meta goal of this paper is to actually
identify such samples which are potential false negatives,
using a methodology based on the Jaccard similarity
[11], [10] measure and K-Nearest Neighbors classifiers.
III. Methodology
The overall process can be summarized by Figure 3,
with the dynamic feature extraction described in the
previous section, followed by the actual methodology
to identify potential false negatives, using a Random
Projection approach and K-Nearest Neighbors classifiers
(described in detail in sections III-C and III-B).

2500

Similarly, the Jcosine cosine similarity is given by


- i
-A Ai
i,i
Jcosine
=
.
|Ai | |Ai |

Number of samples

2000

Note that the Jcosine cosine similarity is expressed as a


scalar product.
Denote by

1500

1000

A=

i=1

500

0
-5

10

15

20

25

30

35

Number of engines detecting malware

40

45

Figure 2.
The mk distribution for the samples used for this
histogram is identical to Figure 1 with the important difference
that samples such that 0 < mi < 11 are discarded. Here 2441
samples are depicted that can be considered as clean (mi = 0),
and 18612 samples that can be considered as malicious (mi > 10).

Dynamic
Behavioral
Feature
Sample

(2)

Malware
Random
Projection

Clean

Figure 3.
Global schematic of the methodology: a sample is
run through the sandbox to obtain a set of dynamic features; the
random projection approach then reduces the dimensionality of
the problem while retaining most of the information conveyed by
the original feature; finally, a K-Nearest Neighbors classifier in the
random projection space gives prediction on the studied sample
being malware or not.

(3)

where N is the total number of samples and D is the


total number of unique hashes seen in all samples.
Then from an ordering of set A, N binary (0,1 valued)
vectors Bi can be constructed, each of K dimensions
such that

- i
(4)
-A Ai - = Bi , Bi ,

- . .2
and -Ai - = .Bi . . Here denotes vector norm and
, denotes scalar product. Since Bi is a binary vector
. .2
(with coordinates 0, 1 only), .Bi . is the number of the
coordinates in Bi that are equal to 1.

So, the normalized scalar product of Bi and Bi gives


the cosine similarity:
i,i
Jcosine

Sandbox

KNN

Ai = {a1 , a2 , ..., aD },

Bi , Bi
..
=. . .
.
.Bi . .
.Bi .

(5)

Using the relationship between Euclidean distance


.
.
.
.
Deuclidean = .Bi Bi .
(6)
. i.
. . = 1 and
and
. .cosine similarity in the case of B
. i.
.B . = 1, it appears that
2
2 Deuclidean
.
(7)
2
From Equation 7 it appears that a classification or
clustering based either on the cosine similarity or on the
Euclidean distance will yield the same result if the norm
of the feature vectors is unity.

Jcosine =

A. Measuring Similarity between Executables

B. K-Nearest Neighbor Classification

In this section, an approach for measuring similarities


between executables is detailed. Let Ai denote the set of
hash values (produced by the sandbox) for file i.
Then, the JJaccard Jaccard similarity between two
executables i, i is calculated as

In this section, a standard method (K-NN, see for example [17], [18], [19], [20]) is described; it can be used to
predict whether an unknown executable is malicious or
benign. The essential assumption of the method is that
malicious (resp. clean) executables are surrounded by
malicious (resp. clean) executables in the D dimensional
Euclidean space spanned by the normalized vectors

i,i
JJaccard

- i
-A Ai |Ai Ai |

(1)

Bi
. i. ,
.B .

(8)

with Bi the binary vectors defined in the previous section. This means that the more hashes two samples have
in common the closer they are in this space (assuming
that the number of hashes in the two samples does not
change).
Let us denote the set of k nearest neighbors of sample
i by Nki . The classification is based on the data provided
by VirusTotal, that is how many anti-virus engines have
considered a given executable as malicious. Let us denote
this number by mi for sample i. In the results section
is examined how well the mi of the neighboring samples
Nki can actually predict if the sample i in question is
malicious or clean.
It is important to mention that to predict if a sample
i is malicious or not, only neighboring samples are used
and not the sample itself. This corresponds to a LeaveOne-Out [21], [22], [23], [24] (LOO) classification rate
when it comes to assessing the accuracy of the K-NN
classifier in the Results section. In [21], [22], it is shown
that the Leave-One-Out estimates well the generalization performances of a classifier if the number of samples
is large enough, which is the case in the experiments.
As the dimensionality of Bi is too large, random projections are used in order to reduce this dimensionality
and therefore reduce the needed computational time
and memory by several orders of magnitude. Random
projections are explained in the following section.
C. Random Projections
as

i,i
Jcosine

i
i
i
Xim = [Xm,1
, Xm,2
, . . . , Xm,d
], Xim N (0, I)

Bi , Bi
..
=. . .
.
.Bi . .
.Bi .

(9)

However, for practical purposes storing the vector Bi is


inconvenient as it requires too much memory (even if
stored as a sparse vector). The reason for this is that D,
the dimensionality of Bi is in the range of a few millions.
In order to alleviate this memory (and the related time)
complexity, random projections are used. For the matter
of projecting to a lower dimensional space, Johnson and
Lindenstrauss [25] have shown that for a set of N points
in d-dimensional space (using an Euclidean norm), there
exists a linear transformation of the data toward a
df -dimensional space, with df O(2 log(N )) which
preserves the distances (and hopefully the topology of
the data) to a 1 factor. Achlioptas [26] has recently
extended this result and proposed a simpler projection
matrix that preserves the distances to the same factor
than the Johnson-Lindenstrauss theorem mentions, at
the expense of a probability on the distance conservation. For theory and other applications of random
projections in machine learning and classification, see for
example [27], [28], [29], [30], [31].

(10)

Xim if m = m , however, if m = m
such that Xim

i
then Xm = Xim . N (0, I) represents a d-dimensional
standard normal distribution for which the covariance
matrix is the identity matrix, I.
Then, for each file i the corresponding random projection is the d-dimensional random vector Yi defined
as

1
1
Yi =
Xim .
(11)
d |Ai | mAi
The scalar product of the random vectors gives the
similarity J, which is a scalar valued random variable.
Using

J i,i = Yi , Yi
(12)

and the definition of Yi , one can see that Pr(J i,i = 1) =


1. Also if file i and i do not have any hashes in common,

i.e. Ai Ai = , then E(J i,i ) = 0.


As an illustrative example, let us calculate -the- exi,i
- ipected
- - similarity,
- E(J - ) by assuming that A =
- i- i
-A - = l and -A Ai - = k. Note that the Jaccard
distance between i and i in this case is k/l. Also, due
to independence

As mentioned earlier the cosine similarity is calculated

To describe the random projection approach, let m


Ai , and

E(Xim , Xim ) = 0 m = m .

(13)

On the other hand, the following scalar product (in


case of matching hashes, m = m ) has the chi-square
distribution:

Xim , Xim 2 (d)


(14)

where 2 (d) denotes the chi-square distribution with ddegree of freedom, whose expectation value is d. Since

only the Xim , Xim terms contribute to E(J i,i ) it can


be deduced that

k
E(J i,i ) =
(15)
l
which agrees with the Jaccard and cosine
in
- - i - similarity
- ithis case. Note that in general if A = -A - then
E(J i,i ) = JJaccard but still E(J i,i ) = Jcosine . Therefore, the Jaccard index is approximated using the cosine
similarity approach defined previously.

IV. Results
In this section, Euclidean distance is used in the ddimensional space spanned by the random projected
representations Yi of the samples. As noted earlier, the
use of Euclidean distance instead of cosine similarity
does not change the results presented in this section as
Pr(J i,i = 1) = 1. The Yi are normalized to unity.

1400
number of detecting engines = 0
number of detecting engines >10

Number of samples

Number of samples

2000
1800
1600
1400
1200
1000
800
600
400
200
0
-5

number of detecting engines = 0


number of detecting engines >10
true positives

1200
1000
800
600

true negatives

200
0

10

15

20

25

30

35

40

0 -5

45

An illustration of the prediction accuracy of the KNN method (see section III-B) is shown in Figure 4, and
described in detail in the following.
i
Let N10
be the set of 10 nearest samples to sample i,
then the prediction of the K-NN method for mi is the
i
mean m
i of values {mi : i N10
} expressed as
- i - 1
m i .
(16)
m
i = -N10 i
i N10

For a given value x on the x-axis, the height of the bar on


y-axis shows for how many samples m
i = x, i.e. y(x) =
|{i : m
i = x}| is true.
The question is how well the number of detecting
engines mi given by VirusTotal compare with their
predicted values, m
i . In order to answer that question,
the samples are divided into two categories: category 1
as supposedly clean (i.e. mi = 0) and category 2 as
supposedly malicious (i.e. mi > 10). They are shown
in Figures 4 and 5. Assuming that m
i = 0 means that
sample i is predicted to be clean and that m
i > 10 that
sample i is predicted to be malicious, there would be
a considerable amount of false positives . The number
of false positives can be reduced by introducing a third
class into the K-NN classifier: unpredictable. The next
section details the results obtained using this additional
third class and a modified K-NN.
B. Accuracy of Modified K-NN Classifier
Figure 5 shows the prediction accuracy of the modified
K-NN classifier. Now, the K-NN classifier has 3 classes:
predicted to be clean, predicted to be malicious, unpredictable.
A sample i is classified as clean if m
i = 0. It is
i
classified as malicious if mi > 10 : i N10
, i.e. if
i
all the 10 nearest neighbors N10 of i are supposedly
malicious. A neighboring sample is considered supposedly malicious if mi > 10, i.e. if it has been flagged
as malicious by more than 10 AV-engines. Furthermore,

Figure 5.

10

15

20

25

30

35

40

45

Prediction accuracy of the modified K-NN classifier.

a sample i is considered to be unpredictable if it does


not fulfill the requirement to be classified as clean or
malicious. In the production of the histogram depicted
in Figure 5, samples that are unpredictable are omitted.
In Figure 5, the concepts of false negative, false positive,
true positive and true negative are illustrated.
Introducing the unpredictable class considerably improves the prediction accuracy for the two other classes.
This improvement is due to the fact that the uncertainty
on the neighbors is used to separate the predictable and
unpredictable samples. An unpredictable sample is a
sample i, such that not all of its neighbors are either
supposedly malicious (i.e. mi > 10) or supposedly
clean (i.e. mi = 0).

Amount

A. Accuracy of K-NN Classifier

Mean of the number of detecting engines of the 10 nearest neighbors

Mean of the number of detecting engines of the 10 nearest neighbors

Figure 4. Illustration of the prediction accuracy of the K-NN


method: Histogram of the number of detecting engines for k = 10
nearest neighbors.

false positives

false negatives

400

16384
4096
1024
256
2048
1024
512
256
256
64
16
4
1
16440
12646
9728
7483
5756
1024
256
64
16

unpredictable

true negatives

false negatives

true positives

false positives
0

10

20

30

40

50

60

70

Number k of nearest neighbors

80

90

100

Figure 6. The entries of the confusion matrix (false positive, false


negative, true positive and true negative) are plotted in this figure
as a function of the parameter k, the number of nearest neighbors.
In addition, the number of unpredictable samples is represented.

C. Influence of the Number of Nearest Neighbors in the


Modified K-NN Classifier on the Confusion Matrix
In Figure 5 are illustrated the notions of false positive, false negative, true positive and true negative. A

prediction for a sample i is considered to be a false


positive if mi > 10 : i Nki and mi = 0 are true
at the same time. This means that all the k-nearestneighbors Nki of sample i are supposedly malicious
(mi > 10 : i Nki ), however, sample i itself is
considered to be supposedly clean (mi = 0). Similarly,
true positive means that mi > 10 : i Nki and
mi > 10 are true for sample i. Furthermore, false
negatives are characterized by mi = 0 : i Nki and
mi > 10, while a true negative is a sample i for which
mi = 0 : i Nki and mi = 0 holds.
The entries of the confusion matrix (false positive,
false negative, true positive and true negative) are plotted in Figure 6 as a function of the parameter k, the
number of nearest neighbors. Sample i is unpredictable
if neither mi = 0 : i Nki nor mi > 10 : i Nki is
true. The number of unpredictable samples increases
monotonically with increasing k, this must be so as
increasing k by one introduces an additional condition
that has to be fulfilled in order for a sample to be
classified as predictable . In fact, if a sample is labeled
as unpredictable for k, it cannot become predictable
for k + l, l > 0.
In Figure 6, one can note that the number of false
and true negatives stops decaying at k = 40. However,
at k = 40 the number of true and false positives are still
decaying at a rapid rate. The reason for this difference
might be that there are much less supposedly clean
samples than supposedly malicious ones. Also, the
cluster size distribution might be different for these two
categories, which could manifest itself in these different
decay behaviors in Figure 6.
Figure 6 can be used to choose the parameter k that
fits the needs of the user of the modified K-NN method.
Furthermore, note the difference in the decay exponents
for true and false positive rates. If k is increased from 2 to
100 the number of true positives decreases from 17150 to
6204, while the number of false positives decreases from
531 to 17. The decrease in the true positives is 64% while
the decrease in false positives is 97%. So if one wants
to increase the true positive/false positive ratio then its
advisable to increase the number of neighbors, k. On the
other hand one should not forget that by increasing k one
also increases the number of unpredictable samples. In
order to limit this amount of unpredictable samples, the
number of nearest neighbors k to use has been chosen
as 11 for the final detection of the false negatives.
D. Influence of the Random Projection Dimension on
the Confusion Matrix
In the previous section, the dependency of the confusion matrix with respect to the number of neighbors
is discussed and the dimension of the random projected
vectors is fixed to be d = 300. In this section, the effects

of varying d on the confusion matrix are investigated.


In order to have a very small number of false negatives
and to demonstrate the influence of d, the number of
neighbors k is chosen to be 30 in this section. Figure
7 shows the dependency of the confusion matrix on
the number of dimensions d of the projected vectors.
Clearly, increasing d improves the results: the number of
unpredictable samples decrease while the true positives
increase and the false positives decrease.
The true and false negatives do not change much with
increasing d. This might be related to the fact that at
k = 30 the decay of true and false negatives in Figure 6
has almost completely stopped. So, even though the low
value of d = 300 might mean that the distances in the
d = 300 dimensional Euclidean random projected space
are noisy compared to the D > 106 dimensional original
space. The samples that are true and false negatives are
insensitive to this noise.
Figure 7 indicates that convergence in all confusion
matrix elements can be reached by using d = 700. By
increasing d even more, no significant improvement is
observed.
The necessity to use the random projection method is
almost unavoidable: if one would like to use the original
space (with dimensionality D > 106 ) the complexity of
the problem (in terms of memory and computational
time) can become an issue as D has been as high as
5.108 in other related experiments. In this situation, if
one wishes to calculate distances between vectors in the
original space then all the data needs to be located in
the memory (since the original space is spanned by all
the hashes produced by the sandbox). Furthermore, here
a set of samples of cardinality of the order of 104 as
been considered. However, future experiments will be on
the scale of 106 samples, where using the original space
might become prohibitive.
The total computational time needed to run the
methodology on the 21053 samples is a few hours using
Python implementation of the random projections and
K-NN. In comparison, without the random projection
approach, the computational time would be estimated
to take few weeks, due to the dimensionality of Bi .
Finally, based on these results one might improve the
previously presented random projection method by using
different number of dimensions d for each pair of distance
calculated. One could treat larger distances with less
accuracy (lower d) while treating smaller distances with
better accuracy (higher d). This is a possible direction
for future research.
E. Manual Analysis by a Human Expert and Further
Work
Using d = 100 projection dimension and a modified KNN with k = 11, 24 false negatives have been extracted

unpredictable

will be investigated and combined with static features


(code signatures, packer information. . . ) extracted from
the samples before sandbox execution. This will be the
natural continuation of the presented work.

true negatives

Acknowledgments

1.3 x 10
1.1
0.9
335

Amount

325
315
3
2.5

false negatives

2
12000
10000
8000
100
80
60
40
0

true positives
false positives
100

200

300

400

500

600

d, dimension of the random projected vectors

700

Figure 7. Dependency of the elements of the confusion matrix with


respect to the number of dimensions d of the projected vectors.

out of the 2441 possibly clean files. This reduced


number allows the manual analysis by a human expert.
According to an F-Secure Corporation expert, 25% of
these 24 files are surely malicious. 50% have a relatively
high probability to be also malicious. The remaining 25%
are considered as clean by the expert.
Even with such a reduced number of candidates,
a human analysis is taking time and has high costs
(especially if the 50% of unsure samples have to be
further investigated). This shows the usefulness of the
presented methodology since it would be impossible to
find enough highly qualified experts to analyze the initial
2441 possibly clean files.
The same methodology will be applied in the future
using different labeling than the one provided by VirusTotal. Also, different dynamic features will be investigated and eventually combined with some static features
(code signatures, packer information. . . ), and possibly
other types of malware in the sample set.
V. Conclusion
In this paper, a robust two-stage methodology has
been introduced in order to both perform classification
of executable files and detect the files with the highest
probability of being false negatives (malware that are
labeled as possibly clean files). It has been shown that
the methodology is not only accurate but is also reducing
by several orders of magnitude the computational time.
This makes the proposed methodology a valid candidate
as a pre-processing tool to provide inputs to forensic
experts in order to detect malwares that have not yet
been detected by the AV engines used in VirusTotal.
Furthermore, this methodology can also be applied to
other labeling. Also, new and different dynamic features

The authors of this paper would like to acknowledge FSecure Corporation for providing the data and software
required to perform this research. Special thanks go to
Pekka Orponen (Head of the ICS Department, Aalto
University), Alexey Kirichenko (Research Collaboration
Manager F-Secure) and Daavid Hentunen (Researcher
F-Secure) for their valuable support and many useful
comments. This work was supported by TEKES as part
of the Future Internet Programme of TIVIT. Part of the
work of Amaury Lendasse and Alexander Ilin is funded
by the Adaptive Informatics Research Centre, Centre of
Excellence of the Finnish Academy.
References
[1] Y. Liu, L. Zhang, J. Liang, S. Qu, and Z. Ni, Detecting
trojan horses based on system behavior using machine
learning method, in Machine Learning and Cybernetics
(ICMLC), 2010 International Conference on, vol. 2, July
2010, pp. 855 860.
[2] I. Firdausi, C. Lim, A. Erwin, and A. Nugroho, Analysis of machine learning techniques used in behaviorbased malware detection, in Advances in Computing,
Control and Telecommunication Technologies (ACT),
2010 Second International Conference on, December
2010, pp. 201 203.
[3] E. Menahem, A. Shabtai, L. Rokach, and Y. Elovici,
Improving malware detection by applying multiinducer ensemble, Computational Statistics & Data
Analysis, vol. 53, no. 4, pp. 1483 1494, 2009.
[4] L. Sun, S. Versteeg, S. Bozta , and T. Yann, Pattern recognition techniques for the classification of malware packers, in Information Security and Privacy, ser.
Lecture Notes in Computer Science, R. Steinfeld and
P. Hawkes, Eds. Springer Berlin / Heidelberg, 2010,
vol. 6168, pp. 370390.
[5] J. Kinable and O. Kostakis, Malware classification
based on call graph clustering, Journal in Computer
Virology, pp. 113, 2011.
[6] A. Srivastava and J. Giffin, Automatic discovery of
parasitic malware, in Recent Advances in Intrusion
Detection (RAID10), ser. Lecture Notes in Computer
Science, S. Jha, R. Sommer, and C. Kreibich, Eds.
Springer Berlin / Heidelberg, 2010, vol. 6307, pp. 97
117.
[7] C. Willems, T.Holz, and F. Freiling, Toward automated
dynamic malware analysis using cwsandbox, IEEE Security and Privacy, vol. 5, pp. 3239, March 2007.

[8] K. Yoshioka, Y. Hosobuchi, T. Orii, and T. Matsumoto,


Vulnerability in public malware sandbox analysis systems, in Proceedings of the 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet,
ser. SAINT 10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 265268.
[9] M.Bailey, J. Andersen, Z. Morleymao, and F. Jahanian, Automated classification and analysis of internet
malware, in Recent Advances in Intrusion Detection
(RAID07), 2007.
[10] P. Jaccard, Etude comparative de la distribution florale
dans une portion des alpes et des jura, Bulletin de la
Societe Vaudoise des Sciences Naturelles, vol. 37, pp.
547579, 1901.
[11] P. Tan, M. Steinbach, and V. Kumar, Introduction to
Data Mining. Addison Wesley, 2005.
[12] Hispasec Systemas, Virus total analysis tool, 2011,
http://www.virustotal.com.
[13] Norman ASA, Norman launches sandbox sdk, April
2009, http://www.norman.com/about_norman/press_
center/news_archive/2009/67431/en.
[14] F-Secure Corporation, F-secure deepguard a proactive response to the evolving threat scenario, November 2006, http://www.rp-net.com/online/filelink/340/
20061106%20F-secure_deepguard_whitepaper.pdf.
[15] , F-secure deepguard 2.0 - white paper,
September 2008, http://www.f-secure.com/system/
fsgalleries/white-papers/f-secure_deepguard_2.0_
whitepaper.pdf.
[16] , Information about system control and deepguard, January 2011, http://www.f-secure.com/kb/
2034.
[17] D. Aha and D. Kibler, Instance-based learning algorithms, in Machine Learning, 1991, pp. 3766.
[18] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft,
When is "nearest neighbor" meaningful? in Int. Conf.
on Database Theory, 1999, pp. 217235.
[19] C. Bishop, Neural Networks for Pattern Recognition,
1st ed. Oxford University Press, USA, Jan. 1996.
[20] P. Devijver and J. Kittler, Pattern recognition: A statistical approach. Prentice Hall, 1982.
[21] B. Efron and R. Tibshirani, An Introduction to the
Bootstrap. New York: Chapman & Hall, 1993.
[22] , Improvemenets on cross-validation: The .632+
bootstrap method, Journal of the American Statistical
Association, vol. 92, no. 438, pp. 548560, 1997.
[23] A. Lendasse, V. Wertz, and M. Verleysen, Model selection with cross-validations and bootstraps - application
to time series prediction with rbfn models, Lecture
Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), vol. 2714, pp. 573580, 2003, cited By
(since 1996) 8.

[24] Q. Yu, Y. Miche, A. Sorjamaa, A. Guilln, A. Lendasse,


and E. Sverin, OP-KNN: Method and applications,
Advances in Artificial Neural Systems, vol. 2010, no.
597373, February 2010, 6 pages.
[25] W. B. Johnson and J. Lindenstrauss, Extensions of
lipschitz mappings into a hilbert space, in Conference
in Modern Analysis and Probability, New Haven, USA,
1982, pp. 189206.
[26] D. Achlioptas, Database-friendly random projections:
Johnson-lindenstrauss with binary coins, J. Comput.
Syst. Sci., vol. 66, no. 4, pp. 671687, 2003.
[27] S. Dasgupta, Experiments with random projection, in
Proceedings of the 16th Conference on Uncertainty in
Artificial Intelligence, ser. UAI 00. San Francisco, CA,
USA: Morgan Kaufmann Publishers Inc., 2000, pp. 143
151.
[28] X. Fern and C. Brodley, Random projection for high
dimensional data clustering: A cluster ensemble approach, in International Conference on Machine Learning (ICML03), 2003, pp. 186193.
[29] D. Fradkin and D. Madigan, Experiments with random
projections for machine learning, in KDD 03: Proceedings of the ninth ACM SIGKDD international conference
on Knowledge discovery and data mining. New York,
NY, USA: ACM, 2003, pp. 517522.
[30] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten,
and A. Lendasse, OP-ELM: Optimally-pruned extreme
learning machine, IEEE Transactions on Neural Networks, vol. 21, no. 1, pp. 158162, January 2010.
[31] S. Vempala, The Random Projection Method, ser. DIMACS Series in Discrete Mathematics and Theoretical
Computer Science. American Mathematical Society,
2005, vol. 65.

A Two-Stage Methodology using K-NN and False


Positive Minimizing ELM for Nominal Data
Classification
Yoan Miche1 , Anton Akusok1 , Jozsef Hegedus1 , Rui Nian4
and Amaury Lendasse1,2,3
Department of Information and Computer Science, Aalto University,
FI-00076 Aalto, Finland
2
IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain
3
Computational Intelligence Group, Computer Science Faculty,
University Of The Basque Country,
Paseo Manuel Lardizabal 1, Donostia/San Sebastin, Spain
4
College of Information and Engineering,
Ocean University of China,
Qingdao, 266003 China
1

Abstract
In this paper, a methodology for performing binary classification on nominal
data under specific constraints is proposed. The goal is to classify as many
samples as possible while avoiding False Positives at all costs, all within the
smallest possible computational time. Under such constraints, a fast way
of calculating pairwise distances between the nominal data available for all
samples, is proposed. A two-stage decision methodology using two types of
classifiers then provides a fast means of obtaining a classification decision on
a sample, keeping False Positives as low as possible while classifying as many
samples as possible (high coverage). The methodology only has two parameters, which respectively set the precision of the distance approximation and
the final tradeo between False Positive rate and coverage. Experimental
results using a specific data set provided by F-Secure Corporation show that
Email address:
{yoan.miche,anton.akusok,jozsef.hegedus,amaury.lendasse}@aalto.fi,
[email protected] (Yoan Miche1 , Anton Akusok1 , Jozsef Hegedus1 , Rui Nian4
and Amaury Lendasse1,2,3 )

Preprint submitted to Elsevier

July 4, 2012

this methodology provides a rapid decision on new samples, with a direct


control over the False Positives.
1. Introduction
Classification problems relying solely on the distances between the dierent samples are common in genetics [1], or syntactic and document resemblance problems [2, 3]. The reason for the direct use of the distance matrix in
these setups is that the original data does not lie in a Euclidean space, but is
usually nominal data, i.e. without any sense of ordering between two dierent
values. As such, distance matrices need to be calculated using non-Euclidean
metrics, usually.
The interest in this paper is about the problem of binary classification for
such nominal data problems, under certain specific constraints: zero False
Positives, high coverage and small computational time.
While the high coverage contraint is rather typical (achieving the highest True Positive and True Negative rates possible), the zero False Positive
constraint is not. In addition, the False Negatives are not regarded as very
important in this problem setup: even if lowering False Negatives means increasing the coverage, the most highly regarded requirement is on the False
Positives.
As mentioned, the fact that the data is nominal makes it mandatory to
use methods which directly deal with the distance matrix. A means of computing this distance matrix is first described, by the use of an approximation
technique based on Min-Wise independent hash function families.
The following Section 2 describes a very specific application of this proposed methodology to Malware detection for computer security. This application is exactly framed by the previously mentioned constraints. In addition, this application provides experimental data on which the proposed
methodology is tested in Section 4.
Section 3 describes first the matter of calculating distances between samples and then how the use of the Jaccard distance remains possible with
the low-computational time imperative, by estimating it using Locality Sensitive Hashing. A 1-Nearest Neighbor classifier is then proposed as a first
step and its shortcomings listed, while Section 4 details the complete twostep methodology which addresses these issues along with the experimental
results.
2

2. A Specific Application
The goal of Anomaly Detection in the context of computer intrusion detection [4] is to identify abnormal behavior defined as deviating from what
is considered normal behavior and signal the anomaly in order to take appropriate measures: identification of the anomaly source, shutdown/closing
of sensitive information or software. . .
Most current anomaly detection systems rely on sets of heuristics or rules
to identify this abnormality. Such rules and heuristics enable some flexibility
on the detection of new anomalies, but still require action from the expert to
tune the rules according to the new situation and the potential new anomalies identified. One ideal goal is then to have a global system capable of
learning what constitutes normal and abnormal behavior and therefore be
able to identify reliably new anomalies [5, 6]. In such a context, the only
human interaction required is the monitoring of the system, to ensure that
the learning phase happened properly.
A small part of the whole anomaly detection problem is studied in this
paper, in the form of a binary classification problem for malware and clean
samples. While the output of this problem is quite typical, the input is not.
In order to compare files together and compute a similarity between them, a
set of features is needed. F-Secure Corporation devised such a set of features
[7], based partly on sandbox execution (virtual environment for a sample
execution [8, 9]). This sandbox is capable of providing a wide variety of
behavioral information (events), which as a whole can be divided into two
main categories: hardware-specific or OS-specific. The hardware-specific information is related to the low-level, mostly CPU-specific, events occurring
during execution of the application being analyzed in the virtual environment (up to the CPU instruction flow tracing). The other category mostly
relates to the events caused by interaction of the application with the virtual OS (the sandbox). This category includes information such as General
Thread/Process events (e.g. Start/Stop/Switching), API call events, specific
events like Structure Exception Handling, system module loading etc. Besides, the sandbox can provide (upon user request) some other information
about application execution, like reaching pre-set breakpoints, detecting untypical behavioral patterns, which are not typical for traditional well-written
benign applications (e.g., so-called anti-emulation and anti-debugging tricks),
etc.
The sandbox features used in the following research are thus the dynamic
3

Static Features
Feature set
Dynamic
Features

Sample
Sandbox

Figure 1: Feature extraction from a file (sample): The sandbox runs the sample in a
virtual environment and extracts dynamic (run-time specific) information; meanwhile a
set of static features are extracted and both sets are combined in the whole feature set.

component of the collected features. Dynamic features in this context refer to


those gathered from the Sandbox while an inspected application was executed
in it. Some examples of those are what API calls were called and with
what parameters, various types of memory and code fingerprints. Static
features refer to some of the features gathered from the executable binary
itself without actually executing it. Some examples of those are what packer
it was compressed with and various code and data fingerprints. There are
15 features from the static domain and as many from the dynamic domain,
containing up to tens of thousands of values each. Each of these features
can be present or absent for one sample (e.g. if the sample studied does not
perform some classical operations in the sandbox, some features do not get
activated). As such, the input data obtained per sample usually consists of
tens of thousands of values for each feature number. The feature values are
represented by CRC64 hashes.
One of the major challenges is related to this data size: Each sample
having some tens of thousands (on average) of feature-value pairs (at most
30 features per sample, with thousands of values per feature for one sample), sample to sample comparisons are non-trivial computationally speaking. Also, due to the nature of the data, measuring similarities between files
requires specific metrics that can be applied to nominal data (i.e. with no
sense of order between values, as opposed to ordinal data). Indeed, since the
actual feature values are encoded as hashes (and represent function strings
and series of arguments, parameters. . . ), classical measures used in Euclidean
spaces do not apply. The Jaccard similarity enables such comparisons and is
4

Actual

Prediction

Malware
Clean

Malware
True Positive (TP)
False Negative (FN)

Clean
False Positive (FP)
True Negative (TN)

Table 1: Confusion Matrix for this binary classification problem.

detailed in Section 3, with the computational challenges it poses.


In addition to this specificity of the data, the requirements on the performance of the classifier are particular as well. As a security company, F-Secure
Corporation needs to have very low false positives on any anomaly detection
system deployed: If a clean file is labeled as a malware (i.e. is a false positive),
it is likely that several clients will see this same error deployed on their machines as well. This single mistake will potentially hinder seriously the work
on all the aected machines, making the clients unhappy about the product
and thus deactivating it or switching to a concurrent one. Therefore, while
typical binary classification problems addressed by machine learning focus
on optimizing the accuracy, one of the goals of the methodology presented
in this paper is to lower the false positives to achieve 0. To clarify notations,
Table 1 summarizes the confusion matrix used in this paper.
Some additional practical constraint also makes this problem particular.
Since the goal is the identification and classification of new malware samples,
there is an imperative on the time it takes to have a decision per sample:
The fastest an answer is provided, the quicker will be the deployment of the
information concerning a new sample, possibly preventing infection at many
other sites. As such, computational times need to be reduced as much as
possible.
3. Problem Description
This section first describes the problem in terms of the nature of the
data at hand, and a way to calculate distances between files, using this very
data. The matter of the computational requirements for such calculations are
addressed by an approximation based on Min-Wise independent families of
hash functions. The parameters of this approximation are then determined
and its eects investigated.

3.1. Data Specifics and Distance Calculation


3.1.1. Data Specifics
Distances in a traditional Euclidean sense are usually calculated for points
which coordinates locate them in the space. Having a data set consisting of
multiple hashes with dierent hashes representing incomparable properties
or attributes, makes that data eectively categorical, and does not allow to
calculate distances in a classical manner. The specifics and origin of the data
set used in this paper are confidential as the data is provided by F-Secure
Corporation. Original values present in the data have been hashed using the
CRC64 hash function, so as to obfuscate the original details.
The data set is composed of a large amount of files (samples), each having
the following structure:
30 possible feature numbers (each representing a dierent class of information recorded about the sample)
For each of these feature numbers, a variable amount of hashes (from
0 to tens of thousands).
The reason for this structure is that some feature numbers are standing for
a wide range of possible informations: if one such feature number stands for
the names of all the functions called in this sample, e.g., the number of
values associated to it is bound to be large for some samples. It is important
to note that the number of feature values per feature number can be very
dierent from file to file.
With this data structure, it is impossible to use traditional Machine
Learning techniques, as most of them rely on the data points position in
the sample space (usually expected to be Euclidean). In this paper, distances between samples are calculated by using the Jaccard index [10, 11],
as presented in the next subsection.
3.1.2. Distance Calculation for Nominal Data
One of the most classical similarity statistics for nominal data is the
Jaccard index [10]. It enables the computation of the similarity between
two sets of nominal attributes as the ratio between the cardinalities of their
intersection and of their union. Denoting A and B two sets of nominal
attributes, the Jaccard index is defined as
JpA, Bq
6

|A X B|
.
|A Y B|

(1)

This index intuitively gives a good sense of overlap (similarity) between


the two sets; the more common attributes (hashes in this case) they have,
the more statical and dynamical properties the corresponding files each
associated with one set share, thus the higher the chance that they are
of the same class. In addition, considering the Jaccard distance J pA, Bq
1 JpA, Bq yields an actual metric, making which enables to use Machine
Learning techniques directly.
In the case of this paper, the files not only have one set of attributes,
but multiple, identified by their feature number. As such let us redefine
A tAi uiPA , where Ai is the set of hashes associated to feature number
i, and A is the set of all feature numbers available for file A. Therefore,
the Jaccard index needs to take into account all such feature numbers. A
straightforward modification of the Jaccard index for this case is to define it
as
JpA, Bq

1
|Ai X Bi |
|C| iPC |Ai | ` |Bi | |Ai X Bi |

(2)

where Ai and Bi are the sets of feature


values for feature number i for

file A and B respectively, and C A B, with A (resp. Bq the set of the


feature numbers for file A (resp. B).
This way, only feature numbers present in both files are accounted for. In
addition, expressing the index like this enables to avoid computing the cardinality of the union, which saves some computational time, as the cardinality
of the sets Ai and Bi are known.
The computational time required for the multiple calculations of the Jaccard distance remains a problem, due to the intersection cardinality calculation. This problem is addressed in the following subsection by approximating
the Jaccard distance.
3.2. Speeding up the distance calculations
The main drawback of the original Jaccard distance lies in the computational time required for its calculation. While the intersection of two sets
(the upper part of the fraction in Eq. 2) is relatively fast for example, the Python language implementation of it has an average complexity of
Opmin t|Ai | , |Bi |uq and a worst case of O p|Ai | |Bi |q [12], the intersection
of such large sets repeated multiple times makes the total computational time
intractable. As mentioned before, the sets Ai for one single feature number
i can total some tens of thousands of elements.
7

As such, the direct Jaccard distance calculations using Eq. 2 cannot be


used. The specific requirement for this problem of near real-time computations raises the need for an fast approximation of the Jaccard distance.
3.2.1. Resemblance as an alternative to Jaccard index
Consider a file named A, and denote by |A| the number of hashes in this
file (to avoid heavy notations, it is considered that only one feature number
is present in the files; the following extends directly to the practical case of
multiple feature numbers per file). Let us define by SpA, lq the set of all
contiguous subsequences of length l of hashes of A. Using these notations,
one can define [3] the resemblance rl pA, Bq of two files A and B based on
their hashes as
|SpA, lq X SpB, lq|
rl pA, Bq
,
(3)
|SpA, lq Y SpB, lq|
which is similar to the original definition of the Jaccard index. Defining the
resemblance distance as
dl pA, Bq 1 rl pA, Bq

(4)

yields an actual metric [3, 2].


Let us fix the size of the contiguous subsequences of hashes l and denote
by l the set of all such subsequences of length l. Let us assume that l
is totally ordered and set a number of elements n. For any subset !l l
denote by MINn p!l q the set of the smallest n elements (using the order on
l ) of !l defined as
#
the set of the smallest n elements from !l , if |!l | n
MINn p!l q
(5)
!l ,
otherwise.
From [3], the following theorem gives an unbiased estimate of the resemblance rl pA, Bq.

Theorem 1. Let : l l a permutation on l chosen uniformly at


random and let M pAq MINn p pS pA, lqqq. Defining M pBq similarly, the
following is an unbiased estimate of rl pA, Bq:
rl pA, Bq

|MINn pM pAq Y M pBqq X M pAq X M pBq|


.
|MINn pM pAq Y M pBqq|

The proof can be found in [3].


As such, once a random permutation is chosen, it is possible to only use
the set M pAq (instead of the whole of A) for resemblance-based calculations.
8

3.2.2. Weak Universal Hashing and Min-Wise Independent Families


Note that while CRC64 cannot be considered as a random hash function,
the notion of weak universality for a family of hash functions proposed in
[13] makes it possible to further extend the former approximation to families
of hash functions satisfying
Pr ph ps1 q h ps2 qq

1
,
M

(6)

with h a hash function chosen uniformly at random from the family H of


functions U M, s1 and s2 elements from the origin space U of the hash
function in H and M |M|. More precisely, in [14], the definition of minwise independent family of functions is proposed in the spirit of the weak
universality concept, and the authors show that for such families of functions,
the resemblance can be computed directly.
Define as min-wise independent a family H of functions such that for any
set X v1, N w and any x P X, when the function h is chosen at random in
H, we have
1
Pr pmin thpXqu hpxqq
.
(7)
|X|
That is, all elements of the set X must have the same probability to
become the minimum element of the image of X under the function h. Assuming such a min-wise independent family H, then
Pr pmin thpSpA, lqqu min thpSpB, lqquq rl pA, Bq,

(8)

for files A and B and a function h chosen uniformly at random from H; it is


therefore possible to compute the resemblance rl pA, Bq of files A and B by
computing the cardinality of the intersection

tmin ph1 pS pA, lqqq , . . . , min phk pS pA, lqqqu


,
(9)
tmin ph1 pS pB, lqqq , . . . , min phk pS pB, lqqqu

where h1 , . . . , hk are a set of k independent random functions from H. This


way of calculating the resemblance of two files is sometimes called min-hash,
and this name is used in the rest of this paper to denote this approach.
For computational and practical reasons, in this paper only one hash
function is used (CRC64) and the cardinality of the intersection of equation
9 is approximated as the cardinality of

tmink ph pS pA, lqqqu tmink ph pS pB, lqqqu ,


(10)
9

where the notation mink pXq denotes the set of the k smallest elements in X
(assuming X is fully ordered). While this is a crude approximation, experiments show that the convergence with respect to k towards the true value
of the resemblance is assured, as shown in the following subsection.
3.2.3. Influence of the number of hashes on the proposed min-hash approximation
Figure 2 illustrates experimentally the validity of the proposed approximation of the Jaccard distance by the min-hash based resemblance. These
plots use a small subset of 3000 samples from the whole dataset, used only
for this purpose of validating the amount of hashes k required for a proper
approximation.
As can be seen, with low amounts of hashes, such as k 10 or 100
(subfigures (a) and (b)), quantization eects appear on the estimation of the
resemblance, and the estimation errors are large. These quantization problems are especially important in regard to the method using these distances
K-Nearest Neighbors , as presented in the next section: Since distances
are so much quantized, samples being at dierent distances appear to be at
the same, and can thus be taken as nearest neighbors wrongly.
The quantization eects are lessened when k reaches the hundreds of
hashes, as in subfigure (c), while the errors on the estimation remain large.
k 2000 hashes reduces such errors to only the largest distances, which are
of less importance for the following methodology. While k 10000 hashes
reduces these errors further (and even more so for larger values of k), the
main reason for using the min-hash approximation described is to reduce
drastically the computational time.
Figure 3 is a plot of the average time required per sample for the determination of the distances to the whole reference set, with respect to the
number of hashes k used for the min-hash. Thanks to the use of the Apache
Cassandra backend (with three nodes) for these calculations1 , the computational time only grows linearly with the number of hashes (and also linearly
with the number of samples in the reference set, although this is not depicted
here). Unfortunately, large values of k do not decrease the computational
time suciently for the practical application of this methodology. Therefore,
1

Details of the implementation are not given in this paper, but can be found from
the publications and deliverables of the Finnish ICT SHOK Programme Future Internet:
http://www.futureinternet.fi

10

(a) k 10 hashes

(b) k 100 hashes

(c) k 500 hashes

(d) k 1000 hashes

(e) k 2000 hashes

(f) k 10000 hashes

Figure 2: Influence of the number of hashes k over the min-hash approximation of the
resemblance r. The exact Jaccard distance is calculated using the whole amount of the
available hashes for each sample.

11

Average Time per Sample (seconds)

300
250
200
150
100
50
0
0

1000

2000

3000

4000
5000
6000
Number of hashes used (k)

7000

8000

9000

10000

Figure 3: Average time per sample (over 3000 samples) versus the number k of hashes
used for the min-hash approximation.

in the following, k 2000 hashes is used for the min-hash approximation


of the Jaccard distance, as a good compromise between computational time
and approximation error.
4. Methodology using two-stage classifiers
This section details the use of a two-stage decision strategy so as to avoid
False Positives while retaining high coverage. The first stage decision uses a
1-NN, which still yields too high False Positive rates; this rate is lowered by
using an optimized Extreme Learning Machine model, specialized either for
False Positives or False Negatives minimization.
4.1. First Stage Decision using 1-NN
4.1.1. Using K-NN with min-hash Distances
The K-Nearest Neighbor [15] method for classification is one of the most
natural to use in this setup, since it relies directly and only on distances. As
mentioned in the previous subsection, for this classifier to perform well, it
requires the proper identification of the real nearest neighbors: the approximation made using the min-hash cannot be too crude.
Using k 2000 hashes, a reference set is devised by F-Secure Corporation which contains samples that are considered to be representative of
most current malware and clean samples. This set contains about 10000 samples (for each of which the k 2000 minimum hashes have been extracted
12

Clean
Clean
Sandbox
Data

ELM FN
Unknown

Nearest Neighbors
with Jaccard Distance

Unknown
Malware

ELM FP
Malware

}
First Stage
Decision

Second Stage
Decision

Figure 4: 1-NN-ELM: Two stage methodology using first a 1-NN and then specialized
ELM models to lower false positives and false negatives. The first stage uses only the
class information C1NN of the nearest neighbor, while the second stage uses additional
neighbors information: the distance d1NN to the nearest neighbor, the distance dNN to
the nearest neighbor of the opposite class and the rank RNN (i.e. which neighbor is it)
of this opposite class neighbor.

per feature number), balanced equally between clean and malware samples.
The determination of this reference set is especially important as it should
not contain samples for which there are some uncertainties about the class:
Only samples with the highest probability of being either malware or clean
are present in the reference set.
Once this reference set is fixed, samples can be compared against it using
the min-hash based distances and a K-NN classifier.
Determining K for this problem is done using a validation set for which
the certainty of the class of each sample is very high as well. The validation
set contains 3000 samples, checked against the reference set of 10000 samples.
Figure 5 depicts the classification accuracy (average of True Positive and True
Negative rates) versus the value of K used for the K-NN. Surprisingly, the
decision based on the very first nearest neighbor is always the best in terms
of classification accuracy. Therefore, in the following methodology presented
in Section 4, a 1-NN is used as the first step classifier.
4.1.2. 1-NN is not sucient
As mentioned earlier, one of the main imperatives in this paper is to
achieve 0 False Positives (in absolute numbers). As Table 2 depicts, by using
a test set (totally separate from the validation sets used above) composed of
28510 samples for which the class is known with the highest confidence, with
13

0.95

Classification Accuracy

0.945
0.94
0.935
0.93
0.925
0.92
0.915
0.91

7
9
11
13
Number of Nearest Neighbors used (K)

15

17

Figure 5: K 1 is the best for this specific data regarding classification accuracy.

Prediction

Malware
Clean

Actual
Malware Clean
18160
183
277
9890

Table 2: Confusion Matrix for the sole 1-NN on the test set. If only the first stage of the
methodology is used, results are unacceptable in terms of False Positive rates.

the 1-NN approach still yields large amounts of False Positives. Note that
this test set is unbalanced, although not significantly.
The results of the 1-NN are not satisfactory regarding the constraint on
the False Positives. An obvious way of addressing directly the amount of
False Positives is to set a maximum threshold on the distance to the first
nearest neighbor: Above this threshold, the sample is deemed too far from
its nearest neighbor, and no decision is taken.
While this strategy would eectively reduce the number of False Positives,
it lowers significantly the number of True Positives as well, i.e. the coverage.
For this reason, and to keep a high coverage, the following methodology using
a second stage classifier as the ELM, is proposed.
As can be seen from Figure 3, the computational time required to calculate the distance from a test sample to the whole 10000 reference set samples
is about 35 seconds on average, using k 2000 hashes. This is still acceptable, from the practical point of view, but adding a second stage classifier
14

(a) Case 1

(b) Case 2

Figure 6: Illustration of dierent situations with identical 1-NN: in (a) the density of reference samples of the same class around the test sample gives the decision high confidence;
in (b) while the 1-NN is of the same class as for (a), the confidence should be very dierent
on the decision.

has the obvious drawback of increasing this time.


In order to make this increase the smallest possible, an Extreme Learning
Machine model specialized for False Positives (and another for False Negatives) is used. Figure 4 illustrates the global idea of this two-stage methodology.
The motivation for an additional classifier comes from the fact that the
single information from the 1-NN is not sucient: the distance to that first
neighbor is important as well, and so is the distance and the rank of the
nearest neighbor of the opposite class. Figure 6 attempts to illustrate two
dierent situations for which a test sample has its first nearest neighbor in
the same class note that the position of the samples has no meaning here,
due to the nominal nature of the data; the distances are the interesting fact.
In the first case (a), the confidence on the decision must be high, as many of
the neighbors of the test sample are near and of the same class. The case (b)
is very dierent and needs to have a much lower confidence on the decision
taken, if any.
A means of describing such situations is to account for:
1. The distance to the nearest neighbor d1NN : If the nearest neighbor is
far, it is likely that the test sample is in a part of the original space
where the reference samples density is insucient;
2. The distance to the nearest neighbor of the opposite class dNN : If
d1NN is very similar to dNN , the test sample lies in a part of the
space where reference samples of both classes are present and at similar
15

distances;
3. The rank of this neighbor of opposite class RNN (is it the 3rd or
100th neighbor?): This information gives a rough sense of the density
of the reference samples of the same class as that of the nearest neighbor
around the test sample.
The combination of these additional three pieces of information describes
roughly the situation in which the test sample lies. This is the information
fed to the second stage classifier for the final decision.
4.2. Second Stage Decision using modified ELM
4.2.1. Original ELM
The Extreme Learning Machine (ELM) algorithm was originally proposed
by Guang-Bin Huang et al. in [16, 17, 18, 19] and it uses the Single Layer
Feedforward Neural Network (SLFN) structure. The main concept behind
the ELM lies in the random initialization of the SLFN weights and biases.
Then, under certain conditions, the synaptic input weights and biases do not
need to be adjusted (classically through an iterative updates such as backpropagation) and it is possible to calculate implicitly the hidden layer output matrix and hence the output weights. The complete network structure
(weights and biases) is thus obtained with very few steps and very low computational cost (compared to iterative methods for determining the weights,
e.g.).
Consider a set of M distinct samples pxi , yi q with xi P Rd1 and yi P Rd2 ;
then, a SLFN with N hidden neurons is modeled as the following sum
N

i1

pwi xj ` bi q, j P J1, M K,

(11)

pwi xj ` bi q yj , j P J1, M K,

(12)

with being the activation function, wi the input weights, bi the biases and
i the output weights.
In the case where the SLFN would perfectly approximate the data, the
i and the actual outputs yi are zero
errors between the estimated outputs y
and the relation between inputs, weights and outputs is then
N

i1

which writes compactly as H Y, with


16

pw1 x1 ` b1 q
..
.

...

pw1 xM ` b1 q

pwN x1 ` bN q

..
,
.
pwN xM ` bN q

(13)

T T
T T
and p 1T . . . N
q and Y py1T . . . yM
q .
Solving the output weights
from the hidden layer output matrix H
and target values is achieved through the use of a Moore-Penrose generalized
inverse of the matrix H, denoted as H: [20].
Theoretical proofs and a more thorough presentation of the ELM algorithm are detailed in the original paper [16]. In Huang et al.s later work it
has been proved that the ELM is able to perform universal function approximation [19].

4.2.2. False Positive/Negative Optimized ELM


As depicted on Figure 6 and mentioned above, the single information of
the class of the nearest neighbor is not sucient to obtain 0 False Positives.
The proposed second stage classifier uses modified ELM models for lowering the amounts of False Positives one of the two modified ELM models
reduces False Negatives as well; only the False Positive minimizing one is
mentioned in the following.
The modified ELM model used in the second stage of the methodology
is specially optimized so as to minimize the False Positives (a similar model
to minimize the False Negatives is used as well, in the same fashion). It uses
additional information gathered while searching for the nearest neighbor (so
no additional computational time is required to obtain the training data): the
distance to the nearest neighbor d1NN , the distance to the nearest neighbor
of the opposite class dNN , and the rank of this neighbor of opposite class
RNN . With this input data, the False Positive Optimized ELM is trained
using a weighted classification accuracy criterion.
While for binary classification problems, the classification rate Acc defined
as the average of the True Positive Rate TPR and True Negative Rate TNR,
TNR ` TPR
,
(14)
2
is typically used as a performance measure, the proposed modified ELM uses
the following weighted accuracy Accpq
Acc

17

TNR ` TPR
.
(15)
1`
By changing the weight, it becomes possible to give precedence to the True
Negative Rate and thus to avoid false positives. The output of the proposed
False Positive Optimized ELM is calculated using Leave-One-Out (LOO)
PRESS (PREdiction Sum of Squares) statistics which provides a direct and
exact formula for the calculation of the LOO error "PRESS for linear models.
See [21] and [22] for details of this formula and its implementations:
Accpq

"PRESS

yi hi i
,
1 hi PhTi

(16)

where P is defined as P pHT Hq1 , H is the hidden layer output matrix of


the ELM and i are the output weights of the ELM.
In order to obtain a parsimonious model in the shortest possible time,
the proposed modified ELM uses the idea of the TROP-ELM [23] and OPELM [24, 25, 26, 27, 28] to prune out neurons from an initially large ELM
model [29]. In addition, for computational time considerations, the maximum number M of selected neurons desired for the final model is taken as
a parameter. Overall, the False Positive Optimized ELM used in this paper
follows the steps of Algorithm 1.
Algorithm 1 False Positive Optimized ELM.
Given a training set pxi , yi q, xi P R3 , yi P t1, 1u, an activation function
: R R, a large number of hidden nodes N and the maximum number
M N of neurons to retain for the final model:
- Randomly assign input weights wi and biases bi , i P J1, N K;
- Calculate the hidden layer output matrix H as in Equation 13;
for i 1 to M do
- Perform Forward Selection of the i best neurons (among N ) using
PRESS LOO output with Accpq criterion, and ELM determination of
the output weights i ;
end for
- Retain the best combination out of the M dierent selections as the final
model structure.
The selection of the optimal is done experimentally, following the two
constraints of 0 False Positives and highest possible coverage (i.e. as many
18

1
0.9

TP rate

0.8
0.7
0.6
0.5
0.4

0.01

0.02

0.03

0.04
FP rate

0.05

0.06

0.07

0.08

Figure 7: ROC curve (True Positive Rate versus False Positive Rate) for varying values
of .

True Positives as possible). Figure 7 is the Receiver Operating Characteristic


curve for various values of , plotted for a balanced 3000 samples validation
set. As can be seen, the requirement on absolutely 0 False Positives has
a strong influence on the coverage (represented by the True Positives rate
here). If one allows as low as 0.06% False Positives, the coverage reaches
92% already.
Figure 8 depicts the plot of the False Positive rate against the value.
This plot is using the same validation data as Figure 7. The value of for
which the 0 False Positives requirement is met while keeping highest possible
coverage is 30, form Figure 8.
4.3. Final Results on Test Data
With the parameters of the two-stage methodology determined as above,
i.e.:
k 2000 hashes used for the min-hash approximation of the Jaccard
distance;
K 1 for the K-NN first stage classifier;
=30 for the False Positive Optimized ELM second stage classifier,
19

0.8
0.7

False Positive Rate

0.6
0.5
0.4
0.3
0.2
0.1
0
0

10

15

20

25

30

Figure 8: Evolution of the False Positive Rate as a function of the weight. The first
attained 0 False Positive Rate is for 30.

the presented methodology is applied to a test set of 28510 samples spanning


from early 2008 until late 2011. The reference set of 10000 samples mentioned
before is within the same time frame and balanced between malware and
clean so as to reflect the real proportions, i.e. that of the samples received
by F-Secure Corporation. The proportions are roughly 2{3 malware
and 1{3 clean.
Tables 3 give the previous results of the sole 1-NN to be compared against
the ones of the 1-NN and False Positive Optimized ELM methodology.
It can be seen that the False Positive rate achieved in test is in line with
the results from the Leave-One-Out in (a).
The results depicted in Table 3 (c) use not only a False Positive Optimized ELM but also a False Negative Optimized ELM, to reduce the False
Negatives, as mentioned on Figure 4. The improvements in the reduction of
the False Positives and the coverage achieved are satisfying for this test set.
A value of 2 False Positives in this test set is probably acceptable in
practice. If the strict goal of 0 False Positives in test is to be enforced,
then one possibility is to increase the parameter to a higher value, more
conservative. This has the eect of lowering further the coverage, though.
Note on hardware and computational time considerations. While the details
of the implementation are not mentioned in this paper, the proposed method20

Malware
Prediction
Clean
Unknown

Actual
Malware Clean
1930
1
1
908
2473
1623

(a) Confusion Matrix for the two-stage classifier methodology on the


training data (Leave-One-Out results).
Actual
Malware Clean
Malware
18160
183
Prediction
Clean
277
9890
(b) Confusion Matrix for the sole 1-NN on the test set.
Actual
Malware Clean
Malware
8393
2
Prediction
Clean
7
4115
Unknown
10037
5956
(c) Confusion Matrix for the two-stage classifier methodology on the test
set.
Table 3: Confusion matrices for (a) the training data (Leave-One-Out results) when training the False Positive/Negative Optimized ELMs; on the whole test set, (b) using only
the 1-NN approach and (c) using the proposed 1-NN and ELM two-stage methodology.
The reduction in coverage from the second stage ELM is noticeable, as False Positives and
Negatives are decreased significantly.

21

ology uses a set of three computers, each equipped with 8GB of RAM, and
Intel Core2 Quad CPUs. Apache Cassandra is the distributed database
framework used for performing ecient min-hash computations in batches,
and a memory-held queueing system (based on memcached) is holding jobs
for execution against Cassandra database. All additional computations are
performed using Python code on one of the three computers mentioned.
With this setup, as seen on Figure 3, the average per sample evaluation
time i.e. calculating pairwise distances to the 10000 reference samples and
finding the closest elements is about 35 seconds. The choice of Cassandra
as a database backend is meant so that the computational time grows only
linearly if the precision of the min-hash or the number of reference samples
is increased linearly: growing the number of reference samples linearly or
the number k of hashes used for the min-hash approximation only requires a
linear growth in the number of Cassandra nodes for the computational time
to remain identical.
5. Conclusions
This paper proposes a practical case oriented methodology for a binary
classification problem in the domain of Anomaly Detection. The practical
problem at hand lies in the classification of files (samples) as either malware
or clean, based on specific sets of nominal attributes, thus requiring purely
distance-based Machine Learning techniques. The practical requirements for
this binary classification problem are somewhat unusual, as no False Positives
can be tolerated, while as many files as possible should be classified in the
minimum computational time. The False Negatives are not as important in
this context.
In order to perform file to file comparisons, a distance measure known
as the Jaccard distance is adapted to this problem setup, and a fast approximation of it, the Min-Hash approximation, is proposed. The Min-Hash
approach enables to obtain an estimation of the Jaccard distance using a restricted amount of the whole sets of attributes of each file, thus lowering the
computational time significantly. This approximation is shown to converge
experimentally to the true Jaccard distance, given enough hashes.
A two-stage decision process using two dierent types classifiers enables
to provide a fast decision while keeping the False Positive rate low: A 1-NN
model using the estimated Jaccard distance provides an initial decision on
the test sample at hand. Following in the second stage is a False Positive
22

Optimized ELM a False Negative Optimized ELM is used as well to reduce


False Negatives , which enables to reduce drastically the False positives,
from 183 to 2 in test, at the cost of a lower coverage. Another advantage of
the ELM-based second classifier is its very low computational time, allowing
to have this second-stage decision for almost no additional time.
Overall, the methodology proves to be ecient for this specific problem
and has the advantage of having only two parameters that require tuning:
the number of hashes used for the Min-Hash approximation the more used,
the close is the approximation to the real Jaccard distance value , and the
coecient weighting the False Positives in the modified ELM criterion the
value of this coecient controls the tradeo between False Positive rate and
coverage directly.
The parameters devised experimentally for the specific reference set enable to reach only 2 False Positives in test, with a coverage on the malware
files of 44%. This methodology is currently being tested at F-Secure Corporation on dierent data sets (reference and test) for further validation.
References
[1] S. Lele, J. T. Richtsmeier, Euclidean distance matrix analysis: a
coordinate-free approach for comparing biological shapes using landmark data., American journal of physical anthropology 86 (3) (1991)
415427.
[2] A. Z. Broder, S. C. Glassman, M. S. Manasse, G. Zweig, Syntactic
clustering of the Web, Computer Networks and ISDN Systems 29 (8-13)
(1997) 11571166.
[3] A. Z. Broder, On the resemblance and Containment of Documents, in:
Compression and Complexity of SEQUENCES 1997, IEEE Comput.
Soc, 1997, pp. 2129.
[4] Y. Robiah, S. S. Rahayu, M. M. Zaki, S. Shahrin, M. A. Faizal, R. Marliza, A New Generic Taxonomy on Hybrid Malware Detection Technique,
arXiv.org cs.CR.
[5] A. Srivastava, J. Gin, Automatic Discovery of Parasitic Malware, in:
S. Jha, R. Sommer, C. Kreibich (Eds.), Recent Advances in Intrusion
Detection (RAID10), Springer Berlin / Heidelberg, 2010, pp. 97117.
23

[6] M. Bailey, J. Andersen, Z. Morleymao, F. Jahanian, Automated classification and analysis of internet malware, in: Recent Advances in Intrusion Detection (RAID07), 2007.
[7] F-Secure Corporation, F-Secure DeepGuard A proactive response to
the evolving threat scenario (Nov. 2006).
[8] C. Willems, T. Holz, F. Freiling, Toward Automated Dynamic Malware
Analysis Using CWSandbox, IEEE Security and Privacy 5 (2007) 3239.
[9] K. Yoshioka, Y. Hosobuchi, T. Orii, T. Matsumoto, Vulnerability in
Public Malware Sandbox Analysis Systems, in: Proceedings of the 2010
10th IEEE/IPSJ International Symposium on Applications and the Internet, IEEE Computer Society, Washington, DC, USA, 2010, pp. 265
268.
[10] P. Jaccard, tude comparative de la distribution florale dans une portion des alpes et du jura, Bulletin de la Socit Vaudoise des Sciences
Naturelles 37 (1901) 547579.
[11] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, 1st
Edition, Addison Wesley, 2005.
[12] Python, Python algorithms complexity, http://wiki.python.org/
moin/TimeComplexity#set (December 2010).
URL http://wiki.python.org/moin/TimeComplexity#set
[13] J. L. Carter, M. N. Wegman, Universal Classes of Hash Functions, Journal of Computer and System Sciences 18 (2) (1979) 143154.
[14] A. Z. Broder, M. Charikar, A. M. Frieze, M. Mitzenmacher, Min-wise
Independent Permutations, Journal of Computer and System Sciences
60 (1998) 327336.
[15] T. M. Cover, P. E. Hart, Nearest neighbor pattern classification, IEEE
Transactions on Information Theory 13 (1) (1967) 2127.
[16] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme Learning Machine: Theory and Applications, Neurocomputing 70 (2006) 489501.

24

[17] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme Learning Machine


for Regression and Multiclass Classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 42 (2) (2012) 513529.
[18] G.-B. Huang, Q.-Y. Zhu, K. Z. Mao, C.-K. Siew, P. Saratchandran,
N. Sundararajan, Can threshold networks be trained directly?, IEEE
Transactions on Circuits and Systems II: Express Briefs 53 (3) (2006)
187191.
[19] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes,
IEEE Transactions on Neural Networks 17 (4) (2006) 879892.
[20] C. R. Rao, S. K. Mitra, Generalized Inverse of Matrices and Its Applications, John Wiley & Sons Inc, 1971.
[21] R. Myers, Classical and Modern Regression with Applications, 2nd edition, Duxbury, Pacific Grove, CA, USA, 1990.
[22] G. Bontempi, M. Birattari, H. Bersini, Recursive lazy learning for modeling and control, in: European Conference on Machine Learning, 1998,
pp. 292303.
[23] Y. Miche, M. van Heeswijk, P. Bas, O. Simula, A. Lendasse, TROPELM: a double-regularized ELM using LARS and tikhonov regularization, Neurocomputing 74 (16) (2011) 24132421. doi:10.1016/j.
neucom.2010.12.042.
[24] E. Group, The op-elm toolbox, available online at http://www.
cis.hut.fi/projects/eiml/research/downloads/op-elm-toolbox
(2009).
[25] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OPELM: Optimally-pruned extreme learning machine, IEEE Transactions
on Neural Networks 21 (1) (2010) 158162. doi:10.1109/{TNN}.2009.
2036259.
[26] Y. Miche, P. Bas, C. Jutten, O. Simula, A. Lendasse, A methodology for
building regression models using extreme learning machine: OP-ELM,

25

in: M. Verleysen (Ed.), ESANN 2008, European Symposium on Artificial Neural Networks, Bruges, Belgium, d-side publ. (Evere, Belgium),
2008, pp. 247252.
[27] M. van Heeswijk, Y. Miche, E. Oja, A. Lendasse, GPU-accelerated and
parallelized ELM ensembles for large-scale regression, Neurocomputing
74 (16) (2011) 24302437. doi:10.1016/j.neucom.2010.11.034.
[28] M. van Heeswijk, Y. Miche, E. Oja, A. Lendasse, Solving large regression
problems using an ensemble of GPU-accelerated ELMs, in: M. Verleysen (Ed.), ESANN2010: 18th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, d-side
Publications, Bruges, Belgium, 2010, pp. 309314.
[29] Y. Lan, Y. C. Soh, G.-B. Huang, Constructive hidden nodes selection
of extreme learning machine for regression, Neurocomputing 73 (16-18)
(2010) 31913199.

26

2 10
40%

10

5000

5000

2008
2008
A2008

5000
A2009

5000

5000

2009

2008

Nc
5000
5000
56000

A2008
A2009
B2009

Nm
5000
5000
56000

2008
2009
2009

A2008
A2008

A2009
B2009

A2009
|B2009 | = 1.12 105 A2009

B2009

A2009

2009
10000
B2009

B2009

A2009

A2009
B2009

i 2 (A2008 [ A2009 [ B2009 )


Ci )
i 2 (A2008 [ A2009 [ B2009 )

Ci

30

600

Ci

x
y(x) = |{i : |Ci | = x, i 2 A2008 }|
x t 103

Ci

Jij 0 Jij 1

Jij
i

j
Jij
j
M k (j)
j

k
k

M (j)

M k (j)

M k (j)

Jij

Jij

j
M k (j)
i

M k (j)
Jij

s
9 i : Jij > s
i
j
j

j
j
s
k

k
j

j
k

Jij

Ci

Cj

Jij

Jij

|Ci \ Cj |
cos
Jij
=p
.
|Ci ||Cj |
ql

S
C = i2A Ci
q = (q1 , q2 , . . . , ql , . . . , qL ),

L = |C|.
q
ci ,

0 1

Ci
ci = (ci1 , ..., cil , ..., ciL ).
cil
cil

1 ql 2 C i
.
0 ql 2
/ Ci

cos
Jij

ci
cos
Jij
=

cj

c|i cj
.
ci cj
N = 105

L
cos
Jij

ci

dL

=0

e
ci = Rci ,
2

=d

E(R| R) = I

c|i
cj

E
c|i cj

E(
c|i
cj ) = E(c|i R| Rcj ) = c|i E(R| R)c|j = c|i cj = |Ci \ Cj |.


c|i
cj
|

ci
cj

c|i cj

|Ci \ Cj |

n
k = O(

log(n))
1

1+

i, j

Jac
=
Jij

N = 106

|Ci \ Cj |
.
|Ci [ Cj |

L
N = 105

R
ci
d

ci
R


ci
k

ci
[
ci ] k
k=1
d
[
ci ] k
0

RN D(, )
N (, )

ci
e
ci = Rci

j 2 Ci

RN D

k=1
d
r
RN D( = 0,
[
ci ] k
[
ci ] k + r

k s

=d

d
sopt

d
d
A2008
A2009

A2009

k
k s
k

s
d
s
k
s
i

d
k

sopt

sEC
opt

NFEC
P (k, s)
k

NFEC
P (k, s)
k
FP0
k = kmin
(s)

FP0
kmin
(s)

NF P (k, s) = 0
A2009
k

FP0
kmin

FP0
k < kmin

NF P (k, s) > 0

5000

NF P (k, s) = 0
Alm (k, s)

lm
Alm (k2 , s) Alm (k1 , s)
k1

k2

k2 > k1

k2
k1
k
NTEC
P (k, s)

k
NTEC
P (k, s)
sEC
opt
EC F P 0
NT P (kmin , s)

NFEC
P (k, s)
k
s
FP0
NTEC
(k
=
k
min , s)
P
s
s
FP0
NTEC
P (kmin , s)

s = 0.1
FP0
NTEC
P (kmin , s)
cos
Ji,j

s = 0.1
s = 0.1
i
j

FP0
(s)
kmin

FP0
NTEC
P (kmin , s)

sopt = 0.1
i
|Ci |

Ci = Cj ,
s = 0.1

sopt = 0.1

s = 0.1
s=0
s

FP0
kmin

s=0

FP0
kmin
s = 0.1

80

500

FP0
kmin

s
s

FP0
kmin

s
k
j

s1
s1

s = s1
s
j

j
m+1
j

s = s2 > s 1

j
m+1
FP0
kmin
(s2 )

FP0
kmin
(s1 )
j

m+1
j

m
j

km

s2

k=m

k
j
k

FP0
s kmin
(s)

s
k
FP0
kmin
(s)

s = 0.1
s = 0.1
s=0

s=0
500

0.1

FP0
kmin

FP0
kmin
(s)

sopt

sopt
sEJ
opt = 0

FP0
NTEJ
P (kmin , s)
EC F P 0
NT P (kmin , s)

EJ
FP0
NTEC
P (kmin , s)

NTEJ
P
EJ
FP0
NT P (kmin , s)

A
s

k
FP0
kmin

k
FP0
kmin

s
s=0
FP0
kmin

FP0
(s)
kmin

s = 0.1
s 2 [0, 0.25]
FP0
(s)
kmin

s = 0.1
s = 0.1
k
s = 0.3

s 2 [0, 0.25]
EC F P 0
Ntp
(kmin , s)

FP0
kmin
(s)

EC
y = NTEC
P (k, s = sopt = 0.1)

0.1)

EJ
x = NFEJ
P (k, s = sopt = 0)

EC
x = NFEC
P (k, s = sopt =
k
EJ
y = NTEJ
P (k, s = sopt = 0)

A2009

5000
2500

2500

y
5000

3500

d!1
s = sEC
opt = 0.1
d
A
FP0
EC
NTRP
P (kmin , s = sopt , d)
d
NTRP
P = 2800
EC F P 0
NT P (kmin , sopt ) = 2798
d = 6000
d

FP0
kmin
(sopt , d)

dopt

dopt = 4000
d = 4000
d = 6000

FP0
kmin
(sopt , d)
FP0
kmin (sopt , d) 80

d
d = 6000

d = 4000

d > 6000

FP0
EC
FP0
NTRP
kmin
(sopt , d)
P (kmin , s = sopt , d)
s = sopt = 0.1

FP0
kmin

e2008 |
Ntrain = |A

B2009
A2009

d = 4000

Btest B

A2009
A2009
B \ A2009 = ;

|Btest | > |A2009 |


e2008 A2008
A

A2008

Ntrain

d = 4000

e2009
B

Ntrain
e2008
A

A2008

Nval

B2009
k
e2008
A

e2009
B

FP0
kmin

A2009

B2009
e2009
B
A2008
FP0
kmin
s = 0.1

FP0
kmin

e2009
B2009 \ B

B2009
NPtest

Nval = 25000
FP0
FP0
kmin
NTRP
P (kmin , s = 0.1)

Nval = 5000
FP0
NFRP
(k
min , s = 0.1)
P
Ntrain
Nval = 25000
Nval = 25000

test
NN

Nval

5000

25000

Nval = 5000
Nval = 5000
FP0
kmin

Ntrain = 10000

FP0
NTRP
P (kmin , s = 0.1)

Ntrain

FP0
NTRP
P (kmin , s = 0.1)

5000

FP0
kmin

Ntrain

Ntrain

e2008
A

Ntrain

A2008
e2008
A

A2008

k
FP0
kmin

Ntrain
FP0
kmin

Nval = 25000
Nval = 5000
FP0
kmin

4 10

Nval = 25000

(106

1000
103 )/2
1000

(106

103 )/2

i
5
7

ci
d < 200

ci
d
5

5
7

8
|Ci |
d

1000

35
generating rand vectors and addition
setting the seed
30

time (s)

25

20

15

10

500

1000

1500

2000

2500
d

As2008 A2008

3000

3500

4000

4500

d
5

A2008

As2009 A2009
As2008 As2009

5000

|Ci | 87000

A2008

A2009

A2009
20%
1000

1000

2008 2009
3.0

Q9650

GT X470

50
d=1000
d=2000
d=3000
d=4000
d=5000

45
40
35

time (s)

30
25
20
15
10
5
0

6
8
number of hash values

10

12
4

x 10

y
i
i
7
7

8
8

j
i
i

i
|Ci |,

d
y
y
x = |Ci |

{(i, j) : i 2 As2008 , j 2 As2009 }

x
y

d = 4000

4
q

1
N

appr.
i (Ji

Jiexact )2

0.0162
J exact )
0.8

0.0118
0.0224

2d/d 0.0224

Ci

ci
600

ci

ci

cj

ci
ci =
c|j
cj

c|i
d

2d

d
2

d
p

p
2d/d 0.0224

ci

2d

=d

10

4000

#
#

K=1

0%

1 n 10

40%
0.002%

10

98.6%

74.37%

86%

97.11%

40%

85%

2.5%

0.3%

1%

3.80%

0.004%

0.02%

71%
21
105

105

2 10

40%

d = 4000)

d = 4000)
500
104

You might also like