Machine Learning Methods For Data Security
Machine Learning Methods For Data Security
Supervisor:
Prof. Pekka Orponen
Instructor:
Doc. Amaury Lendasse
2012
#
#
k1
0%
1 n 10
40%
0.002%
103
98.6%
74.37%
86%
97.11%
40%
85%
2.5%
0.3%
1%
3.80%
0.004%
0.02%
105 q
4000
71%
21
105
N
A tC1 , C2 , ..., Ci , ..., CN u.
A
Ci
pn
pn
59
pn
hn q
pn
3
8
h
1000
Dip
p
Di
yppq |
p
i Di |
pp, hn q
pp, hn q
p
Dip
Ci
p
Dip
pPP
59
59
Ci u H
p PP
thn | pp , hn q P
p
i
Di59
x t 103
30
x
yppq |
Dip | N 1000
1000
i
600
y
p
x
59
5000
5000
restricted to clean
samples only
10
number of hashes
10
10
10
10
X: 335
Y: 27
10
10
100
200
300
400
500
600
number of files
700
800
900
59
335
27
59
59
59
ypxq te P
Di : |ti : e P Di u| xu .
pm
hm
vm
Abal
S
S
S
ppm , hm , vm q
m
vm P V
Abal
v
ppm , hm q ppn , hn q
|V | 24
ppn , hn q P Ci
ppm , hm , vm q P S
Abal
86000
A86k
86000
A86k
100
30
Abal
v
1700
Abal
Abal
A86k
vm
v
v
y
v
v
v
x
i
x
zi
i
zi
AV T
32683
x
zi x
32683
A86k
AV T
AV T
32683
AV T
2500
01
T
AV01
zi
T
AV01
i
i
i
T
AV T AV01
tCi P AV T : zi 0u Y tCi P A : zi 10u
zi
zi 0
zi 0
zi 10
zi 1
10
A86k
Alarge
Alarge
Alarge
gn ppn , hn q
Alarge
0 Jij 1
Jij
Ci
Cj Ci
Jij
i
Dip
i
p
Dip
Dip
Di
|Di X Dj |
cos
Jij
a
.
|Di ||Dj |
p
p
cos
Jij
1
D
g pg1 , g2 , . . . , gl , . . . , gL q,
i Di
L |D|.
01
g
Di
xi ,
#
1 gl P Di
.
0 gl R Di
cos
Jij
L
cos
Jij
xi
x| xj
cos
Jij
i .
xi xj
xi
xi
xi {xi
xj
N 105
xi {xi
xi {xi
Di
i, j
Jac
Jij
|Di X Dj |
.
|Di Y Dj |
xil
xil
#
wl
gl P Di
gl R Di
wl
wl logpN {Nl q
N
Nl
gl
logpN {0q 0 0
wl log
N
,
Nl ` 1
logpN {0q 0 0
xi
x
i Rxi ,
R
EpR| Rq I
dL
x
|i x
j
0
E
d1
x|i xj
x|i xj
|Di X Dj |
d
d 4000
k Op2 logpnqq
1
1`
N 106
L
N 105
R
xi
d
x
i
r
x i sk
k1
d
r
x i sk 0
x
i
k
xi
RN Dp, q
N p, q
RN D
k1
d
r RN Dp 0,
r
xi sk r
x i sk ` r
g
g
x
i
x
i Rxi
j P Di
En
D
d1 q
E
n
#
En
n pEq
E,
: g g
|E|
g
M pDi q
p pDi qq
Jac |
Jz
ij
Jac
Jz
ij
Jac
Jij
pDi q
3
x
tw1 , w2 , ..., wM u
sij
x
W H
r
}wx}
W
r
w
x
sij ai aj ai
r
w
Tp
wi hi px wi q
ai
hi N
k1
ai e2
ak
}xwi }2
105
Jij
j
j
M pjq
k
M k pjq
M k pjq
M k pjq
Jij
k
j
M k pjq
Jij s
Jij s
M k pjq
M k pjq
k1
d1N N
R
R
R
pq
`
1`
p1
q`
1`
Abal
i
x
100
100
p P Pi
x
p
i
Pi
p 1 p
x
i
pPPi |Di |
i
x
,
} pPPi |Dip |1 x
p
i }
Ci
100
100
|Dip | Dip
Abal
Abal
Abal
Jij
1
}x qk }
N k xPQ
k
Qk
A86k
2000
1
100
1.3
A86k
A86k
10000
0.6
1.1
50
Tp 50
0.8
5
Tp 50 N 10000
0.2
0.8 Tp 50 N 10000
0.8 0.2
0.8
2700
20
k 2700
0.363
0.328
0.8
1.2
0.2
0.8 Tp 50 N 10000
1.2
0.8 Tp 50 N 10000
1.2
0.2
1{0.15 6.6
1.5
1
0.2
0.8 Tp 50 N 10000
1
0.2
0.8 Tp 50 N 10000
70
0.71
" 0.8
1.2
k 70
0.63
1.2
Tp
1
7
k 142
0.2
0.8 Tp 200 N 86000
N 86000
142
28
100
0.67
0.60
59
16384
4096
unpredictable
1024
Amount
256
2048
1024
512
256
true negatives
256
64
16
4
1
16440
12646
9728
7483
5756
1024
256
64
16
0
false negatives
true positives
false positives
10
20
30
40
50
60
Number k of nearest neighbors
70
80
90
k
T
AV01
6000{18612 32%
300
s0
16{2441 0.65%
k 100
100
Alarge
Ntrain q
Nval
112000
2 105
d 4000
40%
d 4000
d 4000
500
104
10 q
5
Nval
k
0
1500
1500
1500
1500
1500
15001 6.66 104 0.0666%
x
15001 6.66 104
30
2{10073 2 104
45.5%
56%
2 104
10%
45.5%
11
1
0.9
0.8
0.7
2009.01
2009.02
2009.03
2009.04
2009.05
2009.06
2009.07
2009.08
2009.09
2009.10
2009.11
0.6
0.5
0.4
0.3
0.2 4
10
10
10
10
11
k
k1
d 1000
p 59
s0
11
11
103
10%
training
training
Dclean
Dmalware
validation
validation
Dclean
Dmalware
D
4000
training
training
Btrain Dclean
Y Dmalware
validation
validation
Dclean
Y Dmalware
Bval
|Btrain | 200
|Bval | 200
x 1001
y
x 1001
s
0
s 0.1
s 0.1
s0
s0
s 0.1
s0
s0
ri ti ti
15000
59
Dip
Di3
Dip
i
Di59
i
91
91
ti
0
1
i
18%
91
15000
1
500`1
a
ti
ti
ti a
xi ` b
x
i
Di59
500
a
15000
i pti ti q
ri ti ti
0
500 ` 1
59
59
x
i
x
i
1
x
i
a
Di
ti
Di
91
x
i
91
OpkN q
OpN 2 q
4
11%
OpM q
M
OpN q
10%
103
50%
50%
59
I. Introduction
Malware detection has been the subject of a large
number of studies (see [1], [2], [3] and [4], [5], [6], [7],
[8]), for example the work of Bailey [9] using signaturebased malware detection approach has shown that recent
malware types require additional information in order to
obtain a good detection.
In this paper, an approach based on the extraction of
dynamic features during sandbox execution is used, as
suggested in [7]. In order to measure similarities between
executable files, the Jaccard Index is used to measure the
similarities between hash values (encoding the dynamic
feature values obtained from the sandbox). The hash
values are transformed into a large number of binary
values which could be used to compute the Jaccard Index
(see [10] for original work in French or [11] in English).
Unfortunately, the dimensionality of such variable space
does not allow the use of traditional classifiers in a
reasonable computational time.
A two-stage methodology is proposed to circumvent
this dimensionality problem. In the first stage, a random
projection is decreasing the variable dimensionality of
the problem and is simultaneously reducing the computational time by several orders of magnitude. In the
second stage, a modified K-Nearest Neighbors classifier
is used with VirusTotal [12] labeling of the file samples.
This two-stage methodology is presented in section III.
2500
2000
Number of samples
1500
1000
500
0
-5
10
15
20
25
30
35
40
45
2500
Number of samples
2000
1500
1000
A=
i=1
500
0
-5
10
15
20
25
30
35
40
45
Figure 2.
The mk distribution for the samples used for this
histogram is identical to Figure 1 with the important difference
that samples such that 0 < mi < 11 are discarded. Here 2441
samples are depicted that can be considered as clean (mi = 0),
and 18612 samples that can be considered as malicious (mi > 10).
Dynamic
Behavioral
Feature
Sample
(2)
Malware
Random
Projection
Clean
Figure 3.
Global schematic of the methodology: a sample is
run through the sandbox to obtain a set of dynamic features; the
random projection approach then reduces the dimensionality of
the problem while retaining most of the information conveyed by
the original feature; finally, a K-Nearest Neighbors classifier in the
random projection space gives prediction on the studied sample
being malware or not.
(3)
- i
(4)
-A Ai - = Bi , Bi ,
- . .2
and -Ai - = .Bi . . Here denotes vector norm and
, denotes scalar product. Since Bi is a binary vector
. .2
(with coordinates 0, 1 only), .Bi . is the number of the
coordinates in Bi that are equal to 1.
Sandbox
KNN
Ai = {a1 , a2 , ..., aD },
Bi , Bi
..
=. . .
.
.Bi . .
.Bi .
(5)
Jcosine =
In this section, a standard method (K-NN, see for example [17], [18], [19], [20]) is described; it can be used to
predict whether an unknown executable is malicious or
benign. The essential assumption of the method is that
malicious (resp. clean) executables are surrounded by
malicious (resp. clean) executables in the D dimensional
Euclidean space spanned by the normalized vectors
i,i
JJaccard
- i
-A Ai |Ai Ai |
(1)
Bi
. i. ,
.B .
(8)
with Bi the binary vectors defined in the previous section. This means that the more hashes two samples have
in common the closer they are in this space (assuming
that the number of hashes in the two samples does not
change).
Let us denote the set of k nearest neighbors of sample
i by Nki . The classification is based on the data provided
by VirusTotal, that is how many anti-virus engines have
considered a given executable as malicious. Let us denote
this number by mi for sample i. In the results section
is examined how well the mi of the neighboring samples
Nki can actually predict if the sample i in question is
malicious or clean.
It is important to mention that to predict if a sample
i is malicious or not, only neighboring samples are used
and not the sample itself. This corresponds to a LeaveOne-Out [21], [22], [23], [24] (LOO) classification rate
when it comes to assessing the accuracy of the K-NN
classifier in the Results section. In [21], [22], it is shown
that the Leave-One-Out estimates well the generalization performances of a classifier if the number of samples
is large enough, which is the case in the experiments.
As the dimensionality of Bi is too large, random projections are used in order to reduce this dimensionality
and therefore reduce the needed computational time
and memory by several orders of magnitude. Random
projections are explained in the following section.
C. Random Projections
as
i,i
Jcosine
i
i
i
Xim = [Xm,1
, Xm,2
, . . . , Xm,d
], Xim N (0, I)
Bi , Bi
..
=. . .
.
.Bi . .
.Bi .
(9)
(10)
Xim if m = m , however, if m = m
such that Xim
i
then Xm = Xim . N (0, I) represents a d-dimensional
standard normal distribution for which the covariance
matrix is the identity matrix, I.
Then, for each file i the corresponding random projection is the d-dimensional random vector Yi defined
as
1
1
Yi =
Xim .
(11)
d |Ai | mAi
The scalar product of the random vectors gives the
similarity J, which is a scalar valued random variable.
Using
J i,i = Yi , Yi
(12)
E(Xim , Xim ) = 0 m = m .
(13)
where 2 (d) denotes the chi-square distribution with ddegree of freedom, whose expectation value is d. Since
k
E(J i,i ) =
(15)
l
which agrees with the Jaccard and cosine
in
- - i - similarity
- ithis case. Note that in general if A = -A - then
E(J i,i ) = JJaccard but still E(J i,i ) = Jcosine . Therefore, the Jaccard index is approximated using the cosine
similarity approach defined previously.
IV. Results
In this section, Euclidean distance is used in the ddimensional space spanned by the random projected
representations Yi of the samples. As noted earlier, the
use of Euclidean distance instead of cosine similarity
does not change the results presented in this section as
Pr(J i,i = 1) = 1. The Yi are normalized to unity.
1400
number of detecting engines = 0
number of detecting engines >10
Number of samples
Number of samples
2000
1800
1600
1400
1200
1000
800
600
400
200
0
-5
1200
1000
800
600
true negatives
200
0
10
15
20
25
30
35
40
0 -5
45
An illustration of the prediction accuracy of the KNN method (see section III-B) is shown in Figure 4, and
described in detail in the following.
i
Let N10
be the set of 10 nearest samples to sample i,
then the prediction of the K-NN method for mi is the
i
mean m
i of values {mi : i N10
} expressed as
- i - 1
m i .
(16)
m
i = -N10 i
i N10
Figure 5.
10
15
20
25
30
35
40
45
Amount
false positives
false negatives
400
16384
4096
1024
256
2048
1024
512
256
256
64
16
4
1
16440
12646
9728
7483
5756
1024
256
64
16
unpredictable
true negatives
false negatives
true positives
false positives
0
10
20
30
40
50
60
70
80
90
100
unpredictable
true negatives
Acknowledgments
1.3 x 10
1.1
0.9
335
Amount
325
315
3
2.5
false negatives
2
12000
10000
8000
100
80
60
40
0
true positives
false positives
100
200
300
400
500
600
700
The authors of this paper would like to acknowledge FSecure Corporation for providing the data and software
required to perform this research. Special thanks go to
Pekka Orponen (Head of the ICS Department, Aalto
University), Alexey Kirichenko (Research Collaboration
Manager F-Secure) and Daavid Hentunen (Researcher
F-Secure) for their valuable support and many useful
comments. This work was supported by TEKES as part
of the Future Internet Programme of TIVIT. Part of the
work of Amaury Lendasse and Alexander Ilin is funded
by the Adaptive Informatics Research Centre, Centre of
Excellence of the Finnish Academy.
References
[1] Y. Liu, L. Zhang, J. Liang, S. Qu, and Z. Ni, Detecting
trojan horses based on system behavior using machine
learning method, in Machine Learning and Cybernetics
(ICMLC), 2010 International Conference on, vol. 2, July
2010, pp. 855 860.
[2] I. Firdausi, C. Lim, A. Erwin, and A. Nugroho, Analysis of machine learning techniques used in behaviorbased malware detection, in Advances in Computing,
Control and Telecommunication Technologies (ACT),
2010 Second International Conference on, December
2010, pp. 201 203.
[3] E. Menahem, A. Shabtai, L. Rokach, and Y. Elovici,
Improving malware detection by applying multiinducer ensemble, Computational Statistics & Data
Analysis, vol. 53, no. 4, pp. 1483 1494, 2009.
[4] L. Sun, S. Versteeg, S. Bozta , and T. Yann, Pattern recognition techniques for the classification of malware packers, in Information Security and Privacy, ser.
Lecture Notes in Computer Science, R. Steinfeld and
P. Hawkes, Eds. Springer Berlin / Heidelberg, 2010,
vol. 6168, pp. 370390.
[5] J. Kinable and O. Kostakis, Malware classification
based on call graph clustering, Journal in Computer
Virology, pp. 113, 2011.
[6] A. Srivastava and J. Giffin, Automatic discovery of
parasitic malware, in Recent Advances in Intrusion
Detection (RAID10), ser. Lecture Notes in Computer
Science, S. Jha, R. Sommer, and C. Kreibich, Eds.
Springer Berlin / Heidelberg, 2010, vol. 6307, pp. 97
117.
[7] C. Willems, T.Holz, and F. Freiling, Toward automated
dynamic malware analysis using cwsandbox, IEEE Security and Privacy, vol. 5, pp. 3239, March 2007.
Abstract
In this paper, a methodology for performing binary classification on nominal
data under specific constraints is proposed. The goal is to classify as many
samples as possible while avoiding False Positives at all costs, all within the
smallest possible computational time. Under such constraints, a fast way
of calculating pairwise distances between the nominal data available for all
samples, is proposed. A two-stage decision methodology using two types of
classifiers then provides a fast means of obtaining a classification decision on
a sample, keeping False Positives as low as possible while classifying as many
samples as possible (high coverage). The methodology only has two parameters, which respectively set the precision of the distance approximation and
the final tradeo between False Positive rate and coverage. Experimental
results using a specific data set provided by F-Secure Corporation show that
Email address:
{yoan.miche,anton.akusok,jozsef.hegedus,amaury.lendasse}@aalto.fi,
[email protected] (Yoan Miche1 , Anton Akusok1 , Jozsef Hegedus1 , Rui Nian4
and Amaury Lendasse1,2,3 )
July 4, 2012
2. A Specific Application
The goal of Anomaly Detection in the context of computer intrusion detection [4] is to identify abnormal behavior defined as deviating from what
is considered normal behavior and signal the anomaly in order to take appropriate measures: identification of the anomaly source, shutdown/closing
of sensitive information or software. . .
Most current anomaly detection systems rely on sets of heuristics or rules
to identify this abnormality. Such rules and heuristics enable some flexibility
on the detection of new anomalies, but still require action from the expert to
tune the rules according to the new situation and the potential new anomalies identified. One ideal goal is then to have a global system capable of
learning what constitutes normal and abnormal behavior and therefore be
able to identify reliably new anomalies [5, 6]. In such a context, the only
human interaction required is the monitoring of the system, to ensure that
the learning phase happened properly.
A small part of the whole anomaly detection problem is studied in this
paper, in the form of a binary classification problem for malware and clean
samples. While the output of this problem is quite typical, the input is not.
In order to compare files together and compute a similarity between them, a
set of features is needed. F-Secure Corporation devised such a set of features
[7], based partly on sandbox execution (virtual environment for a sample
execution [8, 9]). This sandbox is capable of providing a wide variety of
behavioral information (events), which as a whole can be divided into two
main categories: hardware-specific or OS-specific. The hardware-specific information is related to the low-level, mostly CPU-specific, events occurring
during execution of the application being analyzed in the virtual environment (up to the CPU instruction flow tracing). The other category mostly
relates to the events caused by interaction of the application with the virtual OS (the sandbox). This category includes information such as General
Thread/Process events (e.g. Start/Stop/Switching), API call events, specific
events like Structure Exception Handling, system module loading etc. Besides, the sandbox can provide (upon user request) some other information
about application execution, like reaching pre-set breakpoints, detecting untypical behavioral patterns, which are not typical for traditional well-written
benign applications (e.g., so-called anti-emulation and anti-debugging tricks),
etc.
The sandbox features used in the following research are thus the dynamic
3
Static Features
Feature set
Dynamic
Features
Sample
Sandbox
Figure 1: Feature extraction from a file (sample): The sandbox runs the sample in a
virtual environment and extracts dynamic (run-time specific) information; meanwhile a
set of static features are extracted and both sets are combined in the whole feature set.
Actual
Prediction
Malware
Clean
Malware
True Positive (TP)
False Negative (FN)
Clean
False Positive (FP)
True Negative (TN)
|A X B|
.
|A Y B|
(1)
1
|Ai X Bi |
|C| iPC |Ai | ` |Bi | |Ai X Bi |
(2)
(4)
1
,
M
(6)
(8)
where the notation mink pXq denotes the set of the k smallest elements in X
(assuming X is fully ordered). While this is a crude approximation, experiments show that the convergence with respect to k towards the true value
of the resemblance is assured, as shown in the following subsection.
3.2.3. Influence of the number of hashes on the proposed min-hash approximation
Figure 2 illustrates experimentally the validity of the proposed approximation of the Jaccard distance by the min-hash based resemblance. These
plots use a small subset of 3000 samples from the whole dataset, used only
for this purpose of validating the amount of hashes k required for a proper
approximation.
As can be seen, with low amounts of hashes, such as k 10 or 100
(subfigures (a) and (b)), quantization eects appear on the estimation of the
resemblance, and the estimation errors are large. These quantization problems are especially important in regard to the method using these distances
K-Nearest Neighbors , as presented in the next section: Since distances
are so much quantized, samples being at dierent distances appear to be at
the same, and can thus be taken as nearest neighbors wrongly.
The quantization eects are lessened when k reaches the hundreds of
hashes, as in subfigure (c), while the errors on the estimation remain large.
k 2000 hashes reduces such errors to only the largest distances, which are
of less importance for the following methodology. While k 10000 hashes
reduces these errors further (and even more so for larger values of k), the
main reason for using the min-hash approximation described is to reduce
drastically the computational time.
Figure 3 is a plot of the average time required per sample for the determination of the distances to the whole reference set, with respect to the
number of hashes k used for the min-hash. Thanks to the use of the Apache
Cassandra backend (with three nodes) for these calculations1 , the computational time only grows linearly with the number of hashes (and also linearly
with the number of samples in the reference set, although this is not depicted
here). Unfortunately, large values of k do not decrease the computational
time suciently for the practical application of this methodology. Therefore,
1
Details of the implementation are not given in this paper, but can be found from
the publications and deliverables of the Finnish ICT SHOK Programme Future Internet:
http://www.futureinternet.fi
10
(a) k 10 hashes
Figure 2: Influence of the number of hashes k over the min-hash approximation of the
resemblance r. The exact Jaccard distance is calculated using the whole amount of the
available hashes for each sample.
11
300
250
200
150
100
50
0
0
1000
2000
3000
4000
5000
6000
Number of hashes used (k)
7000
8000
9000
10000
Figure 3: Average time per sample (over 3000 samples) versus the number k of hashes
used for the min-hash approximation.
Clean
Clean
Sandbox
Data
ELM FN
Unknown
Nearest Neighbors
with Jaccard Distance
Unknown
Malware
ELM FP
Malware
}
First Stage
Decision
Second Stage
Decision
Figure 4: 1-NN-ELM: Two stage methodology using first a 1-NN and then specialized
ELM models to lower false positives and false negatives. The first stage uses only the
class information C1NN of the nearest neighbor, while the second stage uses additional
neighbors information: the distance d1NN to the nearest neighbor, the distance dNN to
the nearest neighbor of the opposite class and the rank RNN (i.e. which neighbor is it)
of this opposite class neighbor.
per feature number), balanced equally between clean and malware samples.
The determination of this reference set is especially important as it should
not contain samples for which there are some uncertainties about the class:
Only samples with the highest probability of being either malware or clean
are present in the reference set.
Once this reference set is fixed, samples can be compared against it using
the min-hash based distances and a K-NN classifier.
Determining K for this problem is done using a validation set for which
the certainty of the class of each sample is very high as well. The validation
set contains 3000 samples, checked against the reference set of 10000 samples.
Figure 5 depicts the classification accuracy (average of True Positive and True
Negative rates) versus the value of K used for the K-NN. Surprisingly, the
decision based on the very first nearest neighbor is always the best in terms
of classification accuracy. Therefore, in the following methodology presented
in Section 4, a 1-NN is used as the first step classifier.
4.1.2. 1-NN is not sucient
As mentioned earlier, one of the main imperatives in this paper is to
achieve 0 False Positives (in absolute numbers). As Table 2 depicts, by using
a test set (totally separate from the validation sets used above) composed of
28510 samples for which the class is known with the highest confidence, with
13
0.95
Classification Accuracy
0.945
0.94
0.935
0.93
0.925
0.92
0.915
0.91
7
9
11
13
Number of Nearest Neighbors used (K)
15
17
Figure 5: K 1 is the best for this specific data regarding classification accuracy.
Prediction
Malware
Clean
Actual
Malware Clean
18160
183
277
9890
Table 2: Confusion Matrix for the sole 1-NN on the test set. If only the first stage of the
methodology is used, results are unacceptable in terms of False Positive rates.
the 1-NN approach still yields large amounts of False Positives. Note that
this test set is unbalanced, although not significantly.
The results of the 1-NN are not satisfactory regarding the constraint on
the False Positives. An obvious way of addressing directly the amount of
False Positives is to set a maximum threshold on the distance to the first
nearest neighbor: Above this threshold, the sample is deemed too far from
its nearest neighbor, and no decision is taken.
While this strategy would eectively reduce the number of False Positives,
it lowers significantly the number of True Positives as well, i.e. the coverage.
For this reason, and to keep a high coverage, the following methodology using
a second stage classifier as the ELM, is proposed.
As can be seen from Figure 3, the computational time required to calculate the distance from a test sample to the whole 10000 reference set samples
is about 35 seconds on average, using k 2000 hashes. This is still acceptable, from the practical point of view, but adding a second stage classifier
14
(a) Case 1
(b) Case 2
Figure 6: Illustration of dierent situations with identical 1-NN: in (a) the density of reference samples of the same class around the test sample gives the decision high confidence;
in (b) while the 1-NN is of the same class as for (a), the confidence should be very dierent
on the decision.
distances;
3. The rank of this neighbor of opposite class RNN (is it the 3rd or
100th neighbor?): This information gives a rough sense of the density
of the reference samples of the same class as that of the nearest neighbor
around the test sample.
The combination of these additional three pieces of information describes
roughly the situation in which the test sample lies. This is the information
fed to the second stage classifier for the final decision.
4.2. Second Stage Decision using modified ELM
4.2.1. Original ELM
The Extreme Learning Machine (ELM) algorithm was originally proposed
by Guang-Bin Huang et al. in [16, 17, 18, 19] and it uses the Single Layer
Feedforward Neural Network (SLFN) structure. The main concept behind
the ELM lies in the random initialization of the SLFN weights and biases.
Then, under certain conditions, the synaptic input weights and biases do not
need to be adjusted (classically through an iterative updates such as backpropagation) and it is possible to calculate implicitly the hidden layer output matrix and hence the output weights. The complete network structure
(weights and biases) is thus obtained with very few steps and very low computational cost (compared to iterative methods for determining the weights,
e.g.).
Consider a set of M distinct samples pxi , yi q with xi P Rd1 and yi P Rd2 ;
then, a SLFN with N hidden neurons is modeled as the following sum
N
i1
pwi xj ` bi q, j P J1, M K,
(11)
pwi xj ` bi q yj , j P J1, M K,
(12)
with being the activation function, wi the input weights, bi the biases and
i the output weights.
In the case where the SLFN would perfectly approximate the data, the
i and the actual outputs yi are zero
errors between the estimated outputs y
and the relation between inputs, weights and outputs is then
N
i1
pw1 x1 ` b1 q
..
.
...
pw1 xM ` b1 q
pwN x1 ` bN q
..
,
.
pwN xM ` bN q
(13)
T T
T T
and p 1T . . . N
q and Y py1T . . . yM
q .
Solving the output weights
from the hidden layer output matrix H
and target values is achieved through the use of a Moore-Penrose generalized
inverse of the matrix H, denoted as H: [20].
Theoretical proofs and a more thorough presentation of the ELM algorithm are detailed in the original paper [16]. In Huang et al.s later work it
has been proved that the ELM is able to perform universal function approximation [19].
17
TNR ` TPR
.
(15)
1`
By changing the weight, it becomes possible to give precedence to the True
Negative Rate and thus to avoid false positives. The output of the proposed
False Positive Optimized ELM is calculated using Leave-One-Out (LOO)
PRESS (PREdiction Sum of Squares) statistics which provides a direct and
exact formula for the calculation of the LOO error "PRESS for linear models.
See [21] and [22] for details of this formula and its implementations:
Accpq
"PRESS
yi hi i
,
1 hi PhTi
(16)
1
0.9
TP rate
0.8
0.7
0.6
0.5
0.4
0.01
0.02
0.03
0.04
FP rate
0.05
0.06
0.07
0.08
Figure 7: ROC curve (True Positive Rate versus False Positive Rate) for varying values
of .
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
15
20
25
30
Figure 8: Evolution of the False Positive Rate as a function of the weight. The first
attained 0 False Positive Rate is for 30.
Malware
Prediction
Clean
Unknown
Actual
Malware Clean
1930
1
1
908
2473
1623
21
ology uses a set of three computers, each equipped with 8GB of RAM, and
Intel Core2 Quad CPUs. Apache Cassandra is the distributed database
framework used for performing ecient min-hash computations in batches,
and a memory-held queueing system (based on memcached) is holding jobs
for execution against Cassandra database. All additional computations are
performed using Python code on one of the three computers mentioned.
With this setup, as seen on Figure 3, the average per sample evaluation
time i.e. calculating pairwise distances to the 10000 reference samples and
finding the closest elements is about 35 seconds. The choice of Cassandra
as a database backend is meant so that the computational time grows only
linearly if the precision of the min-hash or the number of reference samples
is increased linearly: growing the number of reference samples linearly or
the number k of hashes used for the min-hash approximation only requires a
linear growth in the number of Cassandra nodes for the computational time
to remain identical.
5. Conclusions
This paper proposes a practical case oriented methodology for a binary
classification problem in the domain of Anomaly Detection. The practical
problem at hand lies in the classification of files (samples) as either malware
or clean, based on specific sets of nominal attributes, thus requiring purely
distance-based Machine Learning techniques. The practical requirements for
this binary classification problem are somewhat unusual, as no False Positives
can be tolerated, while as many files as possible should be classified in the
minimum computational time. The False Negatives are not as important in
this context.
In order to perform file to file comparisons, a distance measure known
as the Jaccard distance is adapted to this problem setup, and a fast approximation of it, the Min-Hash approximation, is proposed. The Min-Hash
approach enables to obtain an estimation of the Jaccard distance using a restricted amount of the whole sets of attributes of each file, thus lowering the
computational time significantly. This approximation is shown to converge
experimentally to the true Jaccard distance, given enough hashes.
A two-stage decision process using two dierent types classifiers enables
to provide a fast decision while keeping the False Positive rate low: A 1-NN
model using the estimated Jaccard distance provides an initial decision on
the test sample at hand. Following in the second stage is a False Positive
22
[6] M. Bailey, J. Andersen, Z. Morleymao, F. Jahanian, Automated classification and analysis of internet malware, in: Recent Advances in Intrusion Detection (RAID07), 2007.
[7] F-Secure Corporation, F-Secure DeepGuard A proactive response to
the evolving threat scenario (Nov. 2006).
[8] C. Willems, T. Holz, F. Freiling, Toward Automated Dynamic Malware
Analysis Using CWSandbox, IEEE Security and Privacy 5 (2007) 3239.
[9] K. Yoshioka, Y. Hosobuchi, T. Orii, T. Matsumoto, Vulnerability in
Public Malware Sandbox Analysis Systems, in: Proceedings of the 2010
10th IEEE/IPSJ International Symposium on Applications and the Internet, IEEE Computer Society, Washington, DC, USA, 2010, pp. 265
268.
[10] P. Jaccard, tude comparative de la distribution florale dans une portion des alpes et du jura, Bulletin de la Socit Vaudoise des Sciences
Naturelles 37 (1901) 547579.
[11] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, 1st
Edition, Addison Wesley, 2005.
[12] Python, Python algorithms complexity, http://wiki.python.org/
moin/TimeComplexity#set (December 2010).
URL http://wiki.python.org/moin/TimeComplexity#set
[13] J. L. Carter, M. N. Wegman, Universal Classes of Hash Functions, Journal of Computer and System Sciences 18 (2) (1979) 143154.
[14] A. Z. Broder, M. Charikar, A. M. Frieze, M. Mitzenmacher, Min-wise
Independent Permutations, Journal of Computer and System Sciences
60 (1998) 327336.
[15] T. M. Cover, P. E. Hart, Nearest neighbor pattern classification, IEEE
Transactions on Information Theory 13 (1) (1967) 2127.
[16] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme Learning Machine: Theory and Applications, Neurocomputing 70 (2006) 489501.
24
25
in: M. Verleysen (Ed.), ESANN 2008, European Symposium on Artificial Neural Networks, Bruges, Belgium, d-side publ. (Evere, Belgium),
2008, pp. 247252.
[27] M. van Heeswijk, Y. Miche, E. Oja, A. Lendasse, GPU-accelerated and
parallelized ELM ensembles for large-scale regression, Neurocomputing
74 (16) (2011) 24302437. doi:10.1016/j.neucom.2010.11.034.
[28] M. van Heeswijk, Y. Miche, E. Oja, A. Lendasse, Solving large regression
problems using an ensemble of GPU-accelerated ELMs, in: M. Verleysen (Ed.), ESANN2010: 18th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, d-side
Publications, Bruges, Belgium, 2010, pp. 309314.
[29] Y. Lan, Y. C. Soh, G.-B. Huang, Constructive hidden nodes selection
of extreme learning machine for regression, Neurocomputing 73 (16-18)
(2010) 31913199.
26
2 10
40%
10
5000
5000
2008
2008
A2008
5000
A2009
5000
5000
2009
2008
Nc
5000
5000
56000
A2008
A2009
B2009
Nm
5000
5000
56000
2008
2009
2009
A2008
A2008
A2009
B2009
A2009
|B2009 | = 1.12 105 A2009
B2009
A2009
2009
10000
B2009
B2009
A2009
A2009
B2009
Ci
30
600
Ci
x
y(x) = |{i : |Ci | = x, i 2 A2008 }|
x t 103
Ci
Jij 0 Jij 1
Jij
i
j
Jij
j
M k (j)
j
k
k
M (j)
M k (j)
M k (j)
Jij
Jij
j
M k (j)
i
M k (j)
Jij
s
9 i : Jij > s
i
j
j
j
j
s
k
k
j
j
k
Jij
Ci
Cj
Jij
Jij
|Ci \ Cj |
cos
Jij
=p
.
|Ci ||Cj |
ql
S
C = i2A Ci
q = (q1 , q2 , . . . , ql , . . . , qL ),
L = |C|.
q
ci ,
0 1
Ci
ci = (ci1 , ..., cil , ..., ciL ).
cil
cil
1 ql 2 C i
.
0 ql 2
/ Ci
cos
Jij
ci
cos
Jij
=
cj
c|i cj
.
ci cj
N = 105
L
cos
Jij
ci
dL
=0
e
ci = Rci ,
2
=d
E(R| R) = I
c|i
cj
E
c|i cj
E(
c|i
cj ) = E(c|i R| Rcj ) = c|i E(R| R)c|j = c|i cj = |Ci \ Cj |.
c|i
cj
|
ci
cj
c|i cj
|Ci \ Cj |
n
k = O(
log(n))
1
1+
i, j
Jac
=
Jij
N = 106
|Ci \ Cj |
.
|Ci [ Cj |
L
N = 105
R
ci
d
ci
R
ci
k
ci
[
ci ] k
k=1
d
[
ci ] k
0
RN D(, )
N (, )
ci
e
ci = Rci
j 2 Ci
RN D
k=1
d
r
RN D( = 0,
[
ci ] k
[
ci ] k + r
k s
=d
d
sopt
d
d
A2008
A2009
A2009
k
k s
k
s
d
s
k
s
i
d
k
sopt
sEC
opt
NFEC
P (k, s)
k
NFEC
P (k, s)
k
FP0
k = kmin
(s)
FP0
kmin
(s)
NF P (k, s) = 0
A2009
k
FP0
kmin
FP0
k < kmin
NF P (k, s) > 0
5000
NF P (k, s) = 0
Alm (k, s)
lm
Alm (k2 , s) Alm (k1 , s)
k1
k2
k2 > k1
k2
k1
k
NTEC
P (k, s)
k
NTEC
P (k, s)
sEC
opt
EC F P 0
NT P (kmin , s)
NFEC
P (k, s)
k
s
FP0
NTEC
(k
=
k
min , s)
P
s
s
FP0
NTEC
P (kmin , s)
s = 0.1
FP0
NTEC
P (kmin , s)
cos
Ji,j
s = 0.1
s = 0.1
i
j
FP0
(s)
kmin
FP0
NTEC
P (kmin , s)
sopt = 0.1
i
|Ci |
Ci = Cj ,
s = 0.1
sopt = 0.1
s = 0.1
s=0
s
FP0
kmin
s=0
FP0
kmin
s = 0.1
80
500
FP0
kmin
s
s
FP0
kmin
s
k
j
s1
s1
s = s1
s
j
j
m+1
j
s = s2 > s 1
j
m+1
FP0
kmin
(s2 )
FP0
kmin
(s1 )
j
m+1
j
m
j
km
s2
k=m
k
j
k
FP0
s kmin
(s)
s
k
FP0
kmin
(s)
s = 0.1
s = 0.1
s=0
s=0
500
0.1
FP0
kmin
FP0
kmin
(s)
sopt
sopt
sEJ
opt = 0
FP0
NTEJ
P (kmin , s)
EC F P 0
NT P (kmin , s)
EJ
FP0
NTEC
P (kmin , s)
NTEJ
P
EJ
FP0
NT P (kmin , s)
A
s
k
FP0
kmin
k
FP0
kmin
s
s=0
FP0
kmin
FP0
(s)
kmin
s = 0.1
s 2 [0, 0.25]
FP0
(s)
kmin
s = 0.1
s = 0.1
k
s = 0.3
s 2 [0, 0.25]
EC F P 0
Ntp
(kmin , s)
FP0
kmin
(s)
EC
y = NTEC
P (k, s = sopt = 0.1)
0.1)
EJ
x = NFEJ
P (k, s = sopt = 0)
EC
x = NFEC
P (k, s = sopt =
k
EJ
y = NTEJ
P (k, s = sopt = 0)
A2009
5000
2500
2500
y
5000
3500
d!1
s = sEC
opt = 0.1
d
A
FP0
EC
NTRP
P (kmin , s = sopt , d)
d
NTRP
P = 2800
EC F P 0
NT P (kmin , sopt ) = 2798
d = 6000
d
FP0
kmin
(sopt , d)
dopt
dopt = 4000
d = 4000
d = 6000
FP0
kmin
(sopt , d)
FP0
kmin (sopt , d) 80
d
d = 6000
d = 4000
d > 6000
FP0
EC
FP0
NTRP
kmin
(sopt , d)
P (kmin , s = sopt , d)
s = sopt = 0.1
FP0
kmin
e2008 |
Ntrain = |A
B2009
A2009
d = 4000
Btest B
A2009
A2009
B \ A2009 = ;
A2008
Ntrain
d = 4000
e2009
B
Ntrain
e2008
A
A2008
Nval
B2009
k
e2008
A
e2009
B
FP0
kmin
A2009
B2009
e2009
B
A2008
FP0
kmin
s = 0.1
FP0
kmin
e2009
B2009 \ B
B2009
NPtest
Nval = 25000
FP0
FP0
kmin
NTRP
P (kmin , s = 0.1)
Nval = 5000
FP0
NFRP
(k
min , s = 0.1)
P
Ntrain
Nval = 25000
Nval = 25000
test
NN
Nval
5000
25000
Nval = 5000
Nval = 5000
FP0
kmin
Ntrain = 10000
FP0
NTRP
P (kmin , s = 0.1)
Ntrain
FP0
NTRP
P (kmin , s = 0.1)
5000
FP0
kmin
Ntrain
Ntrain
e2008
A
Ntrain
A2008
e2008
A
A2008
k
FP0
kmin
Ntrain
FP0
kmin
Nval = 25000
Nval = 5000
FP0
kmin
4 10
Nval = 25000
(106
1000
103 )/2
1000
(106
103 )/2
i
5
7
ci
d < 200
ci
d
5
5
7
8
|Ci |
d
1000
35
generating rand vectors and addition
setting the seed
30
time (s)
25
20
15
10
500
1000
1500
2000
2500
d
As2008 A2008
3000
3500
4000
4500
d
5
A2008
As2009 A2009
As2008 As2009
5000
|Ci | 87000
A2008
A2009
A2009
20%
1000
1000
2008 2009
3.0
Q9650
GT X470
50
d=1000
d=2000
d=3000
d=4000
d=5000
45
40
35
time (s)
30
25
20
15
10
5
0
6
8
number of hash values
10
12
4
x 10
y
i
i
7
7
8
8
j
i
i
i
|Ci |,
d
y
y
x = |Ci |
x
y
d = 4000
4
q
1
N
appr.
i (Ji
Jiexact )2
0.0162
J exact )
0.8
0.0118
0.0224
2d/d 0.0224
Ci
ci
600
ci
ci
cj
ci
ci =
c|j
cj
c|i
d
2d
d
2
d
p
p
2d/d 0.0224
ci
2d
=d
10
4000
#
#
K=1
0%
1 n 10
40%
0.002%
10
98.6%
74.37%
86%
97.11%
40%
85%
2.5%
0.3%
1%
3.80%
0.004%
0.02%
71%
21
105
105
2 10
40%
d = 4000)
d = 4000)
500
104