Fileprints Identifying File Types by n-gram Analysis
Fileprints Identifying File Types by n-gram Analysis
Abstract – We propose a method to analyze files to categorize malicious code can be hidden by many techniques, such
their type using efficient 1-gram analysis of their binary as the use of binders, packers or code obfuscation [3, 4].
contents. Our aim is to be able to accurately identify the true In the case of network traffic analysis, due to different
type of an arbitrary file using statistical analysis of their packet fragmentation, the beginning portion of a file may
binary contents without parsing. Consequently, we may
not be entirely contained in a single packet datagram or it
determine the type of a file if its name does not announce its
true type. The method represents each file type by a compact may be purposely padded and intermixed with other data
representation we call a fileprint, effectively a simple means of in the same packet to avoid signature-based detection.
representing all members of the same file type by a set of Finally, not all file types have a distinct “magic number”.
statistical 1-gram models. The method is designed to be highly
efficient so that files can be inspected with little or no In this paper we propose a method to analyze the contents
buffering, and on a network appliance operating in high of exemplar files using efficient statistical modeling
bandwidth environment or when streaming the file from or to techniques. In particular, we apply n-gram analysis to the
disk. binary content of a set of exemplar “training” files and
produce normalized n-gram distributions representing all
I. INTRODUCTION files of a specific type. Our aim is to determine the
validity of files claiming to be of a certain type (even
Files typically follow naming conventions that use though the header may indicate a certain file type, the
standard extensions describing its type or the applications actual content may not be what is claimed) or to
used to open and process the file. However, although a determine the type of an unnamed file object. In our prior
file may be named Paper.doc1, it may not be a legitimate work, we exploited this modeling technique in network
Word document file unless it is successfully opened and packet content analysis for zero-day worm detection [5].
displayed by Microsoft Word, or parsed and checked by We extend that work here for checking file types, whether
tools, such as the Unix file command, if such tools exist in network traffic flows or on disk. In our prior work we
for the file type in question. generate many models conditioned on port/service and
length of payload. This generates a set of models that very
The Unix file command performs several syntactic checks accurately represent normal data flow and identifies
of a file to determine whether it is text or binary different data quite accurately.
executable, otherwise it is deemed data, a “catch all” for
just about anything. These tests include checking of McDaniel and Heydari [6] introduced algorithms for
header information, for example, for “magic numbers”, to generating “fingerprints” of file types using byte-value
identify how the file was created. One test performed by distributions of file content that is very similar in spirit to
file considers whether the bulk of byte values are the work reported here. There are, however, several
printable ASCII characters, and hence such files are important differences in our work. First, they compute a
deemed text files. single representative fingerprint for the entire class of file
types. Our work demonstrates that it is very difficulty to
The magic numbers serve as a "signature" to identify the produce one single descriptive model that accurately
file type, such as in the case of .PDF files where the reporesents all members of a single file type class. Their
header contains "25 50 44 46 2D 31 2E". However, a pure reported experimental results also demonstrate this.
signature-based (or string compare) file type analyzer [1, Hence, we introduce the idea of computing a set of
2] runs several risks. Not all file types have such magic centroid models and use clustering to find a minimal set
numbers. The beginning of the file could be damaged or of centroids with good performance. Furthermore, the
purposely missing if obfuscation is used. For example, McDaniel paper describes tests using 120 files divided
among 30 types, with 4 files used to compute a model for
1
For our purposes here, we refer to .DOC as Microsoft each type. We have discovered that files within the same
Word documents, although other applications use the type may vary greatly (especially documents with
.DOC extension such as Adobe Framemaker, Interleaf embedded objects such as images), and hence so few a
Document Format, and Palm Pilot format, to name a few. number of exemplars may achieve poor performance.
work, a supervised training strategy was applied to model the training file type distributions, and hence may provide
known malicious code and detect members of that class in a more accurate fileprint classifier with fewer false
email attachments. The n-gram distributions we used as positives. We extend this strategy to the extreme case.
input to a supervised Naïve Bayes machine learning Rather than computing cluster centroids, we consider a set
algorithm to compute a single classifier of “malicious file of exemplar files of a certain type as the fileprint. Each
content”. In this work, we extend these techniques by test file is compared to this set of pre-assigned exemplar
calculating the entire 1-gram distributions of file content files. The performance results of each of these tests show
and use these as models for a set of file types of interest. remarkably good results. We also perform these same
The distribution of byte values of a file are compared to tests using different portions of the files, a strategy we
the models using well known statistical distribution call truncation.
distance measures. Here, we restrict our attention to only
two, Mahalanobis and Manhattan distance. In section II we briefly describe n-gram analysis and an
overview of the modeling techniques used in this study.
For concreteness, suppose we have a file, F, of unknown Section III details the data sets and the detailed
type. In general, to distinguish between file-types A and experimental results. Section IV concludes the paper.
B, we compute two models, MA and MB, corresponding to
file types A and B, respectively. To test the single file F, II. FILEPRINTS
we check the distribution of its content against both
models, and see which one it most closely matches. We A. n-gram Analysis
assert its name as F.B if the contents of F closely match
model MB, i.e., the D( (F.B), MB ) is less than some Before demonstrating and graphically plotting the
threshold, where refers to the 1-gram distribution (a fileprints of the file contents, we first introduce n-gram
histogram), and D is a distance function. analysis. An n-gram [9] is a subsequence of N
consecutive tokens in a stream of tokens. n-gram analysis
Alternatively, given file F.A, we may check its contents has been applied in many tasks, and is well understood
against MA and if the 1-gram distribution of F.A is too and efficient to implement.
distant from the model MA, we may assert that F.A is
anomalous and hence misnamed. Thus, D( (F.A), MA ) > By converting a string of data to n-grams, it can be
TA for some preset threshold TA. We may suspect that F.A embedded in a vector space to efficiently compare two or
is infected with foreign content and thus subject it to more streams of data. Alternatively, we may compare the
further tests, for example, to determine whether it has distributions of n-grams contained in a set of data to
embedded exploit code. determine how consistent some new data may be with the
set of data in question.
In this paper, we test whether we can accurately classify
file types. This test is used to corroborate our thesis that An n-gram distribution is computed by sliding a fixed-
file types have a regular 1-gram representation. We apply size window through the set of data and counting the
a test to 800 normal files with 8 different extensions. We number of occurrences of each “gram”. Figure 2 displays
compute a set of fileprints (or “centroid” models) for each an example of a 3-byte window sliding right one byte at a
of the 8 distinct types, and test a set of files for correct time to generate each 3-gram. Each 3-gram is displayed in
classification by those models. Ground truth is known so the highlighted “window”.
accurate measurements are possible.
possible 3-grams. A string of M letters would thus have Alternatively, in some applications we may compute a
(M-2) 3-grams with a distribution that is quite sparse. distinct threshold setting, TA for each model MA
computed. If the distance of the test file and MA is at or
In this work, we focus initially on 1-gram analysis of bellow TA, the test file will be classified as type A. An
ASCII byte values. Hence, a single file is represented as a initial value of TA may simply be the maximum score of
256-element histogram. This is a highly compact and the model distance to its training data, plus some small
efficient representation, but it may not have sufficient constant, .
resolution to represent a class of file types. The results of
our experiments indicate that indeed 1-grams perform Since the type of a test file is unknown, we need a precise
well enough, without providing sufficient cause requiring context in order to build a set of models for a set of
higher order grams to be considered. expected file types. Suppose we are interested in building
a virus detector for some host system, such as a Windows
B. Mahalanobis Distance client machine. That client may regularly produce or
exchange MS Office documents, PDF files, compressed
Given a training data set of files of type A, we compute a archives, photographs or image files, and raw text files. In
model MA. For each specific observed file, MA stores the this example, we would need to model probably about 10
average byte frequency and the standard deviation of each representative file types expected for the client machine in
byte’s frequency. question. These 10 models would serve to protect the
machine, by validating all files loaded or exchanged at
Note that the training and model computation of the byte run time. Recall, our goal is to ensure that a file claiming
to be of type A actually matches the corresponding
value mean frequency, x , may be computed in real-time
fileprint for A. For example, when receiving a file with
as an incremental function as extension .DOC that contains non-ASCII characters, it
x × N + x N +1 x −x should be checked against the MS Word fileprint. In order
x= = x + N +1 , to compute such models, we use the existing store of the
N +1 N +1 client for training data to compute the fileprints. We
and similarly for the computation of the standard
follow this strategy in the experiments performed and
deviation. Hence, the models may be computed very
described in the following sections. However, for some
efficiently while streaming the data without the need to
file types, we searched the internet using Google to
fully buffer the file.
prepare a set of “randomly chosen” representatives of a
file type, to avoid any bias a single client machine may
Once a model has been computed, we next consider the
produce, and to provide the opportunity for other
comparison of a test file against this model, either to
researchers to validate our results by accessing the same
validate the file’s purported type, or to assign a type to a
files that are also available to them.
file of unknown origin. We use Mahalanobis Distance for
this purpose. Mahalanobis Distance weights each
C. Modeling and Testing Technique
variable, the mean frequency of a 1-gram, by its standard
deviation and covariance. The computed distance value is
In this section, we describe several strategies to improve
a statistical measure of how well the distribution of the
the efficiency and accuracy of the technique: truncation,
test example matches (or is consistent with) the training
reducing the amount of data modeled in each file, and
samples, i.e. the normal content modeled by the centroids.
If we assume each byte value is statistically independent, multiple-centroids computed via clustering, a finer-
grained modeling of each file type.
we can simplify the Mahalanobis Distance as:
n −1
D ( x, y ) = (| xi − y i | /(σ i + α ))
i =0 1. Truncation
where x is the feature vector of the new observation,
Truncation simply means we model only a fixed portion
y is the averaged feature vector computed from the
of a file when computing a byte distribution. That portion
training examples, σ i is the standard deviation and α is a may be a fixed prefix, say the first 1000 bytes, or a fixed
smoothing factor. This leads to a faster computation, with portion of the tail of a file, as well as perhaps a middle
essentially no impact on the accuracy of the models. The portion. This has several advantages:
distance values produced by the models are then subjected
to a test. If the distance of a test datum, the 1-gram - For most files, it can be assumed that the most
distribution of a specific file, is closest to some model relevant part of the file, as far as its particular type, is
computed for a set of files of a certain type, the file is located early in the file to allow quick loading of
deemed of that type. meta-data by the handler program that processes the
file type. This avoids analyzing a good deal of the
payload of the file that is not relevant to present the results of experiments on both truncated and
distinguishing their type and that may be similar or non-truncated files to test this conjecture.
identical to several different file types. (For example,
2. Centroids
the bulk of the data of compressed images, .JPG, may
appear to have a similar distribution – a uniform byte There are good reasons why some file types have similar
value distribution – to that of encrypted files, such as distributions. Figure 4 compares MS Office formats
.ZIP.) (Word, PowerPoint, and Excel). The formats are similar,
and the technique presented in this paper would certainly
- Truncation dramatically reduces the computing time not be sufficient to distinguish the different sub-types
for model building and file testing. In network from one another. However, it may be the case that any
applications this has obvious advantages. Only the one of the models, or all of them at once, can be used to
first packet storing the prefix of the file may be distinguish any MS Office document from, say, a virus.
analyzed, ignoring the stream of subsequent packets.
If a file whose size is 100MB is transmitted over
TCP, only 1 out of thousands of packets would
therefore be processed.
Distance. The Manhattan Distance is defined as follows. same header, which is “D0 CF 11 E0 A1 B1 1A E1”. We
Given two files A and B, with byte frequency thus assign all files of DOC, PPT and XLS files as a
distributions, Ai and Bi, i = 0,…,255, their Manhattan single class represented by .DOC in the figures below.
Distance is defined as:
255 The files vary in size, each are approximately from 10K
D(A,B) = | Ai − Bi | to 1MB bytes long. To avoid a problem of sample bias,
i =0 we only compare files with similar size in the following
The K-means algorithm that computes multiple centroids experiments. For example, a 100K bytes file can be
is briefly described as follows. compared to a 200K file, but not a 1MB bytes file.
1. Randomly pick K files from the training data set.
These K files (their byte value frequency distribution) B. File Type Classification
are the initial seeds for the first K centroids
representing a cluster. 1. One-centroid file type model accuracy
2. For each remaining file in the training set, compute
the Manhattan Distance against the K selected In this section, we seek to determine how well each
centroids, and assign that file to the closest seed fileprint accurately identifies files of its own type using
centroid. both of the entire and truncated content of the files in
3. Update the centroid byte value distribution with the question.
distribution of the assigned file.
4. Repeat step 2 and 3 for all remaining files, until the For each file type x, we generated a single (one-centroid)
centroids stabilize without any further substantive model Mx. For example, we computed Mpdf by using 80%
change. of the collected PDF files. Since we had 8 different types
The result is a set of K centroid models, MkA which are of files (EXE and DLL are considered as one type, and
later used in testing unknown files. DOC, PPT, and XLS are in one group), we generated 5
models totally. The rest of 20% files of each type are used
III. FILE TYPE CLASSIFICATION AND ANOMALY DETECTION: as the test data.
EXPERIMENTAL RESULTS
In the truncated cases, we modeled and tested the first 20,
200, 500 and 1000 bytes of each file. This portion
In this section we describe several experiments to test
whether fileprints accurately classify unknown files or includes the “magic numbers” of the file type if it exists.
validate the presumed type of a file. The experiments Such analysis was used to establish a baseline and
determine whether all the files tested in question contain
performed include tests of single models computed for all
essentially common header information.
files of a single type, multiple models computed over the
same type, and models computed under truncation of
The results are quite amazing. There was only a few
files. We report the average accuracy over 100 trials using
cross validation for each of the modeling technique. misclassified file when we used truncated files. The
classification accuracy results are shown in the top row of
Table 1. In the row of 20 and 200 bytes, the results are
A. File Sets
almost perfect. There are some common problems. First,
image, GIF and JPG, types are sometimes similar. The
We used 2 groups of data sets. The first data set includes
second, document files (PDF and MS office types) may
8 different file types, EXE, DLL, GIF, JPG, PDF, PPT,
include images. These may also cause misclassification
DOC and XLS. Models for each were computed using
error. The last, PDF files (with or without images) may be
100 files of each type. The files were collected from the
classified to the GIF category.
internet using a general search on Google. For example,
.PDF files were collected from Google by using the
In cases where file boundaries are easily identifiable, it is
search term “.pdf”. In this way, the files can be
rather straightforward to identify the file type from header
considered randomly chosen as an unbiased sample. (The
information alone. This serves as a baseline and a first
reader can also repeat our experiments since the same
level of detection that should work well in practice.
files may be available through the same simple search.)
However, we next turn our attention to the more general
case where header information is damaged or missing or
In our earlier experiment, we found that EXE and DLL
purposely replaced to avoid detection of the true file type.
have essentially the exact same header information and
We thus extend the analysis to the entire file content.
extremely similar 1-gram distributions. They are used to
similar purpose in MS system. We consider that they are
in the same class in this paper. The contents of MS Office
One-centroid file type classifying accuracy
file types are also similar (see Figure 4). They have the Truncation EXE GIF JPG PDF DOC AVG.
We generated K models for each of the types of files, In this paper, we demonstrate the 1-gram binary
Mkexe, Mkdoc, Mkpdf, for example. If K = 10, a total of 50 distribution of files for different file types. The
models (5 groups of test files) are tested using experiments demonstrate that every file type has a
Mahalanobis Distance to determine the closest file type distinctive distribution that we regard as a “fileprint”.
model. The results are shown in the middle row of Table This observation is important. The centroid models
1. representing the byte value distributions of a set of
training files can be an effective tool in a number of
Compare each of these results of the multi-centroids test applications including the detection of security policy
to the previous one centroid case. The results are better. violations. Techniques that may be used by attackers to
We also tested several sizes of K. Basically, the results hide their malware from signature-based systems will
are similar. have a tougher time being stealthy to avoid detection
using these techniques.
3. Exemplar files used as centroids
Moreover, we found that the truncated modeling
We may extend the multi-centroids method without using technique performs as well if not better than modeling
K-means. In this experiment we test each file against the whole files, with superior computational performance.
distributions of a randomly chosen set of exemplar files. This implies real-time network-based detectors that
The same technique was used as described in the previous accurately identify the type of files flowing over a
tests, but here we randomly choose 80% of the files as the network are achievable at reasonable cost.
representative samples of their file type. The other 20%
of the files are test files. In this case we compare the 1- As future work, a number of interesting alternative
V. ACKNOWLEDGEMENTS
VI. REFERENCES