0% found this document useful (0 votes)

12 views8 pages

Fileprints Identifying File Types by n-gram Analysis

The document presents a method for identifying file types through n-gram analysis of binary contents, aiming to accurately categorize files without relying on their names or headers. This approach, termed 'fileprints', utilizes statistical modeling to create compact representations of file types, enhancing detection capabilities against obfuscation techniques used by malicious code. The authors demonstrate that their method achieves high accuracy in file type classification, outperforming traditional signature-based analysis.

Uploaded by

ali.alali.hctc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views8 pages

Fileprints Identifying File Types by n-gram Analysis

Uploaded by

ali.alali.hctc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Proceedings of the 2005 IEEE

Workshop on Information Assurance

United States Military Academy, West Point, NY June 2005

Fileprints: Identifying File Types by n-gram Analysis

Wei-Jen Li, Ke Wang, Salvatore J. Stolfo, Benjamin Herzog Columbia University

{Wei-Jen, Kewang, Sal} @cs.columbia.edu

Abstract – We propose a method to analyze files to categorize malicious code can be hidden by many techniques, such
their type using efficient 1-gram analysis of their binary as the use of binders, packers or code obfuscation [3, 4].
contents. Our aim is to be able to accurately identify the true In the case of network traffic analysis, due to different
type of an arbitrary file using statistical analysis of their packet fragmentation, the beginning portion of a file may
binary contents without parsing. Consequently, we may
not be entirely contained in a single packet datagram or it
determine the type of a file if its name does not announce its
true type. The method represents each file type by a compact may be purposely padded and intermixed with other data
representation we call a fileprint, effectively a simple means of in the same packet to avoid signature-based detection.
representing all members of the same file type by a set of Finally, not all file types have a distinct “magic number”.
statistical 1-gram models. The method is designed to be highly
efficient so that files can be inspected with little or no In this paper we propose a method to analyze the contents
buffering, and on a network appliance operating in high of exemplar files using efficient statistical modeling
bandwidth environment or when streaming the file from or to techniques. In particular, we apply n-gram analysis to the
disk. binary content of a set of exemplar “training” files and
produce normalized n-gram distributions representing all
I. INTRODUCTION files of a specific type. Our aim is to determine the
validity of files claiming to be of a certain type (even
Files typically follow naming conventions that use though the header may indicate a certain file type, the
standard extensions describing its type or the applications actual content may not be what is claimed) or to
used to open and process the file. However, although a determine the type of an unnamed file object. In our prior
file may be named Paper.doc1, it may not be a legitimate work, we exploited this modeling technique in network
Word document file unless it is successfully opened and packet content analysis for zero-day worm detection [5].
displayed by Microsoft Word, or parsed and checked by We extend that work here for checking file types, whether
tools, such as the Unix file command, if such tools exist in network traffic flows or on disk. In our prior work we
for the file type in question. generate many models conditioned on port/service and
length of payload. This generates a set of models that very
The Unix file command performs several syntactic checks accurately represent normal data flow and identifies
of a file to determine whether it is text or binary different data quite accurately.
executable, otherwise it is deemed data, a “catch all” for
just about anything. These tests include checking of McDaniel and Heydari [6] introduced algorithms for
header information, for example, for “magic numbers”, to generating “fingerprints” of file types using byte-value
identify how the file was created. One test performed by distributions of file content that is very similar in spirit to
file considers whether the bulk of byte values are the work reported here. There are, however, several
printable ASCII characters, and hence such files are important differences in our work. First, they compute a
deemed text files. single representative fingerprint for the entire class of file
types. Our work demonstrates that it is very difficulty to
The magic numbers serve as a "signature" to identify the produce one single descriptive model that accurately
file type, such as in the case of .PDF files where the reporesents all members of a single file type class. Their
header contains "25 50 44 46 2D 31 2E". However, a pure reported experimental results also demonstrate this.
signature-based (or string compare) file type analyzer [1, Hence, we introduce the idea of computing a set of
2] runs several risks. Not all file types have such magic centroid models and use clustering to find a minimal set
numbers. The beginning of the file could be damaged or of centroids with good performance. Furthermore, the
purposely missing if obfuscation is used. For example, McDaniel paper describes tests using 120 files divided
among 30 types, with 4 files used to compute a model for
1
For our purposes here, we refer to .DOC as Microsoft each type. We have discovered that files within the same
Word documents, although other applications use the type may vary greatly (especially documents with
.DOC extension such as Adobe Framemaker, Interleaf embedded objects such as images), and hence so few a
Document Format, and Palm Pilot format, to name a few. number of exemplars may achieve poor performance.

ISBN 555555555/$10.00  2005 IEEE

Proceedings of the 2005 IEEE
Workshop on Information Assurance
United States Military Academy, West Point, NY June 2005

Indeed, they report 3 variant algorithms achieving an

accuracy of 28%, 46% and 96%. The results we report
using a clustering strategy produces better results. In the
case of the test producing an accuracy of 96% in their
work, they analyze the leading header portion of files.
Our work shows that each file type consists of fairly
regular header information and we achieve a near 100%
accuracy in file type classification. However, some file
types do not have consistent header information. When
more data is used as in our case, their results are rather
poor.

There is also a significant difference in the method used

to normalize their data. They state “Once the number of
occurrences of each byte value is obtained, each element
in the array is divided by the number of occurrences of
the most frequent byte value. This normalizes the array to
frequencies in the range of 0 to 1, inclusive." In their
work, they seek to build a model invariant to file size.
This may not be a good strategy. We believe a more
proper normalization strategy would be to compute the
byte value frequencies normalized by the length of the
file. We demonstrate that this achieves more accurate
centroid models.
a
Figure 1 displays a set of plots of example 1-gram
distributions for a collection of popular file types. The 1- Figure 1: File binary distribution. X axis: bytes from 0 to
gram distribution shows the average frequency of each 255, Y axis: normalized frequency of byte values (as %).
byte value over all training files represented as a 256-
element histogram. The plots show the byte values in There are many potential uses of fileprints. As a network
order from 0, 1, 2,…, 255. Notice how distinct each application, one may be able to quickly analyze packet
distribution is for each file type. These trained models data to identify the likely type of information being
serve to classify unknown files, or to validate the transmitted. Network integrity applications may be
extension of a file by comparing the byte value supported, and security policies may therefore be
distribution of the file in question to one or more model checked. For example, the transmission of Word
file distributions.2 The histograms may contain the actual documents outside of a LAN may be prohibited by
byte count, or it may be normalized so that the percentage company policy. Users quickly learn how to thwart such
of each byte value is represented. The choice is a subtle policies, by renaming their Word documents to .DDD, for
technical issue. For example, normalized histograms example, to avoiding detection based upon tests of the file
allow different length files to be compared directly. name in email attachments. However, a quick test of the
1-gram distribution of the content of the packet datagrams
Notice that the full 1-gram distribution, which is at most suggesting the content matches a Word document
two 256-element vectors (representing the average byte distribution would prevent such obfuscation. Thus, any
frequency, and their variance), is very space efficient. We Word document file seen on the wire but without the
conjecture that these simple representations of file types .DOC extension could be identified quickly by its fileprint
serve as a distinct representation of all members of a and filtered before leaving the LAN.
single type of file, and hence refer to this concept as a
fileprint. A fileprint may be a set of such histograms to Furthermore, infected file shares may be detected if they
represent a variety of example files of the same type. do not conform to an expected file type distribution.
Hence, virus and worm propagation may be thwarted if
certain files do not match their announced type. n-gram
analysis of file content was first proposed for the
detection of malicious virus code in our earlier work on
2
Since the byte value 0 is used often to pad files in the Malicious Email Filter project [7]. (That work has
various formats, one may ignore this value and focus on recently been extended by Kolter and Maloof [8] who
the remaining byte value distribution without loss of evaluate a variety of related techniques.) In that prior
accuracy.
ISBN 555555555/$10.00  2005 IEEE
Proceedings of the 2005 IEEE
Workshop on Information Assurance
United States Military Academy, West Point, NY June 2005

work, a supervised training strategy was applied to model the training file type distributions, and hence may provide
known malicious code and detect members of that class in a more accurate fileprint classifier with fewer false
email attachments. The n-gram distributions we used as positives. We extend this strategy to the extreme case.
input to a supervised Naïve Bayes machine learning Rather than computing cluster centroids, we consider a set
algorithm to compute a single classifier of “malicious file of exemplar files of a certain type as the fileprint. Each
content”. In this work, we extend these techniques by test file is compared to this set of pre-assigned exemplar
calculating the entire 1-gram distributions of file content files. The performance results of each of these tests show
and use these as models for a set of file types of interest. remarkably good results. We also perform these same
The distribution of byte values of a file are compared to tests using different portions of the files, a strategy we
the models using well known statistical distribution call truncation.
distance measures. Here, we restrict our attention to only
two, Mahalanobis and Manhattan distance. In section II we briefly describe n-gram analysis and an
overview of the modeling techniques used in this study.
For concreteness, suppose we have a file, F, of unknown Section III details the data sets and the detailed
type. In general, to distinguish between file-types A and experimental results. Section IV concludes the paper.
B, we compute two models, MA and MB, corresponding to
file types A and B, respectively. To test the single file F, II. FILEPRINTS
we check the distribution of its content against both
models, and see which one it most closely matches. We A. n-gram Analysis
assert its name as F.B if the contents of F closely match
model MB, i.e., the D( (F.B), MB ) is less than some Before demonstrating and graphically plotting the
threshold, where refers to the 1-gram distribution (a fileprints of the file contents, we first introduce n-gram
histogram), and D is a distance function. analysis. An n-gram [9] is a subsequence of N
consecutive tokens in a stream of tokens. n-gram analysis
Alternatively, given file F.A, we may check its contents has been applied in many tasks, and is well understood
against MA and if the 1-gram distribution of F.A is too and efficient to implement.
distant from the model MA, we may assert that F.A is
anomalous and hence misnamed. Thus, D( (F.A), MA ) > By converting a string of data to n-grams, it can be
TA for some preset threshold TA. We may suspect that F.A embedded in a vector space to efficiently compare two or
is infected with foreign content and thus subject it to more streams of data. Alternatively, we may compare the
further tests, for example, to determine whether it has distributions of n-grams contained in a set of data to
embedded exploit code. determine how consistent some new data may be with the
set of data in question.
In this paper, we test whether we can accurately classify
file types. This test is used to corroborate our thesis that An n-gram distribution is computed by sliding a fixed-
file types have a regular 1-gram representation. We apply size window through the set of data and counting the
a test to 800 normal files with 8 different extensions. We number of occurrences of each “gram”. Figure 2 displays
compute a set of fileprints (or “centroid” models) for each an example of a 3-byte window sliding right one byte at a
of the 8 distinct types, and test a set of files for correct time to generate each 3-gram. Each 3-gram is displayed in
classification by those models. Ground truth is known so the highlighted “window”.
accurate measurements are possible.

Several modeling strategies are explored. First, we use a

“one-class” training evaluation strategy. A set of files of
the same type are used in their entirety to train a single
model. For example, given 5 different file types, we
compute 5 distinct fileprints characterizing each type. A Figure 2: Sliding window (window size = 3)
test file with an extension of one of these types is thus
compared to the corresponding fileprint. This validation is The choice of the window size depends on the
computed by the Mahalanobis distance function applied application. First, the computational complexity increases
to the distributions. In the second case, we compute exponentially as the window size increases. Data is
multiple models for each file type by clustering the considered a stream of tokens drawn from some alphabet.
training files using K-means. The set of models are If the number of distinct tokens (or the size of the
considered the fileprint. In this case, a test file is alphabet) is X, then the space of grams grows as XN. In the
compared to all of the models of all of the types to case of 3-grams computed over English text composed of
determine the closest model. The latter case produces the 26 letters of the alphabet, the space is 263 distinct
more models, but each provides a finer grained view of
ISBN 555555555/$10.00  2005 IEEE
Proceedings of the 2005 IEEE
Workshop on Information Assurance
United States Military Academy, West Point, NY June 2005

possible 3-grams. A string of M letters would thus have Alternatively, in some applications we may compute a
(M-2) 3-grams with a distribution that is quite sparse. distinct threshold setting, TA for each model MA
computed. If the distance of the test file and MA is at or
In this work, we focus initially on 1-gram analysis of bellow TA, the test file will be classified as type A. An
ASCII byte values. Hence, a single file is represented as a initial value of TA may simply be the maximum score of
256-element histogram. This is a highly compact and the model distance to its training data, plus some small
efficient representation, but it may not have sufficient constant, .
resolution to represent a class of file types. The results of
our experiments indicate that indeed 1-grams perform Since the type of a test file is unknown, we need a precise
well enough, without providing sufficient cause requiring context in order to build a set of models for a set of
higher order grams to be considered. expected file types. Suppose we are interested in building
a virus detector for some host system, such as a Windows
B. Mahalanobis Distance client machine. That client may regularly produce or
exchange MS Office documents, PDF files, compressed
Given a training data set of files of type A, we compute a archives, photographs or image files, and raw text files. In
model MA. For each specific observed file, MA stores the this example, we would need to model probably about 10
average byte frequency and the standard deviation of each representative file types expected for the client machine in
byte’s frequency. question. These 10 models would serve to protect the
machine, by validating all files loaded or exchanged at
Note that the training and model computation of the byte run time. Recall, our goal is to ensure that a file claiming
to be of type A actually matches the corresponding
value mean frequency, x , may be computed in real-time
fileprint for A. For example, when receiving a file with
as an incremental function as extension .DOC that contains non-ASCII characters, it
x × N + x N +1 x −x should be checked against the MS Word fileprint. In order
x= = x + N +1 , to compute such models, we use the existing store of the
N +1 N +1 client for training data to compute the fileprints. We
and similarly for the computation of the standard
follow this strategy in the experiments performed and
deviation. Hence, the models may be computed very
described in the following sections. However, for some
efficiently while streaming the data without the need to
file types, we searched the internet using Google to
fully buffer the file.
prepare a set of “randomly chosen” representatives of a
file type, to avoid any bias a single client machine may
Once a model has been computed, we next consider the
produce, and to provide the opportunity for other
comparison of a test file against this model, either to
researchers to validate our results by accessing the same
validate the file’s purported type, or to assign a type to a
files that are also available to them.
file of unknown origin. We use Mahalanobis Distance for
this purpose. Mahalanobis Distance weights each
C. Modeling and Testing Technique
variable, the mean frequency of a 1-gram, by its standard
deviation and covariance. The computed distance value is
In this section, we describe several strategies to improve
a statistical measure of how well the distribution of the
the efficiency and accuracy of the technique: truncation,
test example matches (or is consistent with) the training
reducing the amount of data modeled in each file, and
samples, i.e. the normal content modeled by the centroids.
If we assume each byte value is statistically independent, multiple-centroids computed via clustering, a finer-
grained modeling of each file type.
we can simplify the Mahalanobis Distance as:
n −1
D ( x, y ) = (| xi − y i | /(σ i + α ))
i =0 1. Truncation
where x is the feature vector of the new observation,
Truncation simply means we model only a fixed portion
y is the averaged feature vector computed from the
of a file when computing a byte distribution. That portion
training examples, σ i is the standard deviation and α is a may be a fixed prefix, say the first 1000 bytes, or a fixed
smoothing factor. This leads to a faster computation, with portion of the tail of a file, as well as perhaps a middle
essentially no impact on the accuracy of the models. The portion. This has several advantages:
distance values produced by the models are then subjected
to a test. If the distance of a test datum, the 1-gram - For most files, it can be assumed that the most
distribution of a specific file, is closest to some model relevant part of the file, as far as its particular type, is
computed for a set of files of a certain type, the file is located early in the file to allow quick loading of
deemed of that type. meta-data by the handler program that processes the
file type. This avoids analyzing a good deal of the

ISBN 555555555/$10.00  2005 IEEE

Proceedings of the 2005 IEEE
Workshop on Information Assurance
United States Military Academy, West Point, NY June 2005

payload of the file that is not relevant to present the results of experiments on both truncated and
distinguishing their type and that may be similar or non-truncated files to test this conjecture.
identical to several different file types. (For example,
2. Centroids
the bulk of the data of compressed images, .JPG, may
appear to have a similar distribution – a uniform byte There are good reasons why some file types have similar
value distribution – to that of encrypted files, such as distributions. Figure 4 compares MS Office formats
.ZIP.) (Word, PowerPoint, and Excel). The formats are similar,
and the technique presented in this paper would certainly
- Truncation dramatically reduces the computing time not be sufficient to distinguish the different sub-types
for model building and file testing. In network from one another. However, it may be the case that any
applications this has obvious advantages. Only the one of the models, or all of them at once, can be used to
first packet storing the prefix of the file may be distinguish any MS Office document from, say, a virus.
analyzed, ignoring the stream of subsequent packets.
If a file whose size is 100MB is transmitted over
TCP, only 1 out of thousands of packets would
therefore be processed.

Figure 4: The bytes distribution of DLL and EXE files.

X axis: bytes from 0 to 255, Y axis: normalized frequency
of byte values (as a %).

The second row of Figure 4 presents another example of

how two different file extensions have similar 1-gram
distributions. These types should be grouped together as a
logically equivalent file type, here .DLL’s and .EXE’s.

On the other hand, files with the same extension do not

always have a distribution similar enough to be
represented by a single model. For example, .EXE files
might be totally different when created for different
purpose, such as system files, games, or media handlers.
Another example is documentation files that may contain
a variety of mixed media. Thus, an alternative strategy for
representing files of a particular type is to compute
“multiple models”. We do this via a clustering strategy.
Figure 3: The byte value distributions of entire file (left Rather than computing a single model MA for files of type
column) and the first 50 bytes (right column) of the same A, we compute a set of models MkA , k>1. The multiple
file types. X-axis: bytes from 0 to 255, Y-axis: model strategy requires a different test methodology,
normalized frequency of byte values (as a %). however. During testing, a test file is measured against all
centroids to determine if it matches at least one of the
Figure 3 displays the effect of truncation over a few centroids. The collection of such centroids is considered a
exemplar file type models. Notice the distributions change fileprint for the entire class. The multiple model technique
quite noticeably from the full file detail (the scale of the creates more accurate models, and separates foreign files
histograms has also changed.) Even so, the models from the normal files of a particular type in more precise
computed under truncation may still retain sufficient manner.
information to characterize the entire class of files to
distinguish different file types. In the next section, we In the experiments reported here, the multiple models are
computed by the K-Means algorithm under Manhattan
ISBN 555555555/$10.00  2005 IEEE
Proceedings of the 2005 IEEE
Workshop on Information Assurance
United States Military Academy, West Point, NY June 2005

Distance. The Manhattan Distance is defined as follows. same header, which is “D0 CF 11 E0 A1 B1 1A E1”. We
Given two files A and B, with byte frequency thus assign all files of DOC, PPT and XLS files as a
distributions, Ai and Bi, i = 0,…,255, their Manhattan single class represented by .DOC in the figures below.
Distance is defined as:
255 The files vary in size, each are approximately from 10K
D(A,B) = | Ai − Bi | to 1MB bytes long. To avoid a problem of sample bias,
i =0 we only compare files with similar size in the following
The K-means algorithm that computes multiple centroids experiments. For example, a 100K bytes file can be
is briefly described as follows. compared to a 200K file, but not a 1MB bytes file.
1. Randomly pick K files from the training data set.
These K files (their byte value frequency distribution) B. File Type Classification
are the initial seeds for the first K centroids
representing a cluster. 1. One-centroid file type model accuracy
2. For each remaining file in the training set, compute
the Manhattan Distance against the K selected In this section, we seek to determine how well each
centroids, and assign that file to the closest seed fileprint accurately identifies files of its own type using
centroid. both of the entire and truncated content of the files in
3. Update the centroid byte value distribution with the question.
distribution of the assigned file.
4. Repeat step 2 and 3 for all remaining files, until the For each file type x, we generated a single (one-centroid)
centroids stabilize without any further substantive model Mx. For example, we computed Mpdf by using 80%
change. of the collected PDF files. Since we had 8 different types
The result is a set of K centroid models, MkA which are of files (EXE and DLL are considered as one type, and
later used in testing unknown files. DOC, PPT, and XLS are in one group), we generated 5
models totally. The rest of 20% files of each type are used
III. FILE TYPE CLASSIFICATION AND ANOMALY DETECTION: as the test data.
EXPERIMENTAL RESULTS
In the truncated cases, we modeled and tested the first 20,
200, 500 and 1000 bytes of each file. This portion
In this section we describe several experiments to test
whether fileprints accurately classify unknown files or includes the “magic numbers” of the file type if it exists.
validate the presumed type of a file. The experiments Such analysis was used to establish a baseline and
determine whether all the files tested in question contain
performed include tests of single models computed for all
essentially common header information.
files of a single type, multiple models computed over the
same type, and models computed under truncation of
The results are quite amazing. There was only a few
files. We report the average accuracy over 100 trials using
cross validation for each of the modeling technique. misclassified file when we used truncated files. The
classification accuracy results are shown in the top row of
Table 1. In the row of 20 and 200 bytes, the results are
A. File Sets
almost perfect. There are some common problems. First,
image, GIF and JPG, types are sometimes similar. The
We used 2 groups of data sets. The first data set includes
second, document files (PDF and MS office types) may
8 different file types, EXE, DLL, GIF, JPG, PDF, PPT,
include images. These may also cause misclassification
DOC and XLS. Models for each were computed using
error. The last, PDF files (with or without images) may be
100 files of each type. The files were collected from the
classified to the GIF category.
internet using a general search on Google. For example,
.PDF files were collected from Google by using the
In cases where file boundaries are easily identifiable, it is
search term “.pdf”. In this way, the files can be
rather straightforward to identify the file type from header
considered randomly chosen as an unbiased sample. (The
information alone. This serves as a baseline and a first
reader can also repeat our experiments since the same
level of detection that should work well in practice.
files may be available through the same simple search.)
However, we next turn our attention to the more general
case where header information is damaged or missing or
In our earlier experiment, we found that EXE and DLL
purposely replaced to avoid detection of the true file type.
have essentially the exact same header information and
We thus extend the analysis to the entire file content.
extremely similar 1-gram distributions. They are used to
similar purpose in MS system. We consider that they are
in the same class in this paper. The contents of MS Office
One-centroid file type classifying accuracy
file types are also similar (see Figure 4). They have the Truncation EXE GIF JPG PDF DOC AVG.

ISBN 555555555/$10.00  2005 IEEE

Proceedings of the 2005 IEEE
Workshop on Information Assurance
United States Military Academy, West Point, NY June 2005

Size gram distribution of an individual file and hence there is

20 98.9% 100% 99% 100% 98.3% 98.9% no variance computed. We thus cannot apply
200 98.3% 91.1% 97% 82.8% 93.7% 93.6%
500 97% 97% 93.4% 80.4% 96.7% 94.3%
Mahalanobis, and instead use Manhattan Distance.
1000 97.3% 96.1% 93.5% 83.4% 82.6% 88.2%
All 88.3% 62.7% 84% 68.3% 88.3% 82% For concreteness, assume we had N files of type x, M ix,
and M files if type y M jy where i = 1, 2,…N and j = 1,
Multi-centroids file type classifying accuracy 2,…M. Then, for each test file Fk, we compute the
Truncation
EXE GIF JPG PDF DOC AVG. Manhattan Distance against each M ix, and M jy.
Size
20 99.9% 100% 98.9% 100% 98.8% 99.4%
200 97% 98.3% 96.6% 95% 97.2% 96.9% We record the smallest distance of Fk to each of the
500 97.2% 98.4% 94.8% 90% 96.9% 96% training files. If the closest file was of type x, Fk was
1000 97% 95.1% 93.5% 90.7% 94.5% 94.6% classified as that type. The results are displayed in the
All 88.9% 76.8% 85.7% 92.3% 94.5% 89.5% bottom of Table 1. In general, the results are better than
both of the previous two methods. The average accuracies
Classifying accuracy using exemplar files as centroids of all the three methods are shown in Figure 5.
Truncation
EXE GIF JPG PDF DOC AVG.
Size
20 100% 100% 100% 100% 98.9% 99.6%
200 99.4% 91.6% 99.2% 100% 98.7% 98.2%
500 99% 93.6% 96.9% 99.9% 98.5% 98%
1000 98.9% 94.9% 96.1% 86.9% 98.6% 96.4%
All 94.1% 93.9% 77.1% 95.3% 98.9% 93.8%
Table 1: The average accuracy of file type classifying test.
First Column: the truncation size, first 20, 200, 500 1000
byte, and the entire file. Other Columns: “EXE”
represents the group which includes .EXE and .DLL.
“DOC” represents the group which includes .DOC, .PPT
and .XLS. “AVG.” represents the overall performance.

2. Multi-centroids for classifying file types

The next experiment tests the multi-centroid model. Figure 5: The classification accuracy -- comparison of
Recall, in this strategy rather than building one model for three different methods. X-axis: Size of truncation (in
each file type, we compute multiple models by K-means bytes). Y-axis: accuracy.
clustering of example files into separate centroids. The
union of these centroids represents the entire file type. IV. CONCLUSION AND FUTURE WORK

We generated K models for each of the types of files, In this paper, we demonstrate the 1-gram binary
Mkexe, Mkdoc, Mkpdf, for example. If K = 10, a total of 50 distribution of files for different file types. The
models (5 groups of test files) are tested using experiments demonstrate that every file type has a
Mahalanobis Distance to determine the closest file type distinctive distribution that we regard as a “fileprint”.
model. The results are shown in the middle row of Table This observation is important. The centroid models
1. representing the byte value distributions of a set of
training files can be an effective tool in a number of
Compare each of these results of the multi-centroids test applications including the detection of security policy
to the previous one centroid case. The results are better. violations. Techniques that may be used by attackers to
We also tested several sizes of K. Basically, the results hide their malware from signature-based systems will
are similar. have a tougher time being stealthy to avoid detection
using these techniques.
3. Exemplar files used as centroids
Moreover, we found that the truncated modeling
We may extend the multi-centroids method without using technique performs as well if not better than modeling
K-means. In this experiment we test each file against the whole files, with superior computational performance.
distributions of a randomly chosen set of exemplar files. This implies real-time network-based detectors that
The same technique was used as described in the previous accurately identify the type of files flowing over a
tests, but here we randomly choose 80% of the files as the network are achievable at reasonable cost.
representative samples of their file type. The other 20%
of the files are test files. In this case we compare the 1- As future work, a number of interesting alternative

ISBN 555555555/$10.00  2005 IEEE

Proceedings of the 2005 IEEE
Workshop on Information Assurance
United States Military Academy, West Point, NY June 2005

modeling techniques should be explored. The truncation

to the tail of a file might be interesting to determine if
common shared files are infected. Furthermore, as noted,
several of the file types (.DOC, .PPT and .XLS) are each
so similar to a single type, MS Office. What features may
be available to tease these sub-types apart? We believe
bigram models are a natural extension to explore for this
purpose. We have also tested these techniques comparing
normal Windows OS files (both groups are EXE files)
against a collection of viruses and worms. The results are
quite good but also preliminary. A wider collection of
test sets is required which is part of our ongoing work.

V. ACKNOWLEDGEMENTS

This work has been partially supported by an SBIR

subcontract entitle “Payload Anomaly Detection” with the
HS ARPA division of the Department of Homeland
Security.

VI. REFERENCES

[1] FileAlyzer, http://www.safer-

networking.org/en/filealyzer/index.html

[2] FILExt – the file extension source http://filext.com/

[3] C. Nachenberg. “Polymorphic virus detection

module.” United States Patent # 5,826,013, October 20,
1998.

[4] P. Szor and P. Ferrie. “Hunting for metamorphic”. In

Proceedings of Virus Bulletin Conference, pages 123 –
144,September 2001.

[5] Ke Wang, Salvatore J. Stolfo. “Anomalous Payload-

based Network Intrusion Detection”. RAID, Sept., 2004.

[6] McDaniel and M. Hossain Heydari. “Content Based

File Type Detection Algorithms.” 6th Annual Hawaii
International Conference on System Sciences (HICSS'03)

[7] Matthew G. Schultz, Eleazar Eskin, and Salvatore J.

Stolfo. “Malicious Email Filter - A UNIX Mail Filter that
Detects Malicious Windows Executables.” Proceedings of
USENIX Annual Technical Conference - FREENIX
Track. Boston, MA: June 2001.

[8] Jeremy Kolter and Marcus A. Maloof. “Learning to

Detect Malicious Executables in the Wild.” ACM
SIGKDD, 2004

[9] M. Damashek. “Gauging similarity with n-grams:

language independent categorization of text.” Science
1995

ISBN 555555555/$10.00  2005 IEEE

The Student's Guide To Financial Aid and Scholarships
No ratings yet
The Student's Guide To Financial Aid and Scholarships
2 pages
A Summer Internship Project Report On
No ratings yet
A Summer Internship Project Report On
49 pages
Bill of Supply For Electricity: Tariff Category: Domestic (Residential)
0% (1)
Bill of Supply For Electricity: Tariff Category: Domestic (Residential)
2 pages
Content Based File Type Detection Algorithms
No ratings yet
Content Based File Type Detection Algorithms
10 pages
File Type Detection Technology
No ratings yet
File Type Detection Technology
12 pages
C++ File Handling Step by Step: A Practical Guide with Examples
From Everand
C++ File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Python File Handling Made Easy: A Practical Guide with Examples
From Everand
Python File Handling Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
Config File Types
From Everand
Config File Types
Frank Wellington
No ratings yet
What Is A File?
No ratings yet
What Is A File?
10 pages
Java File Handling Step by Step: A Practical Guide with Examples
From Everand
Java File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
myEntropy File Type Identification using Entropy Scoring
No ratings yet
myEntropy File Type Identification using Entropy Scoring
9 pages
Content Based File Type Detection Algorithms
No ratings yet
Content Based File Type Detection Algorithms
11 pages
File Format: 2 Patents
No ratings yet
File Format: 2 Patents
7 pages
Signature Analysis and Computer Forensics
100% (1)
Signature Analysis and Computer Forensics
11 pages
jupiteryo
No ratings yet
jupiteryo
11 pages
11Mar_Mayer
No ratings yet
11Mar_Mayer
106 pages
Avatar
No ratings yet
Avatar
9 pages
Computer File: 1 Etymology
No ratings yet
Computer File: 1 Etymology
6 pages
Csc 202 Note
No ratings yet
Csc 202 Note
32 pages
Lesson13 Files
No ratings yet
Lesson13 Files
26 pages
Binary File: "Binaries" Redirects Here. For Double Stars, See - ".Bin" Redirects Here. For The CD Image Format, See
No ratings yet
Binary File: "Binaries" Redirects Here. For Double Stars, See - ".Bin" Redirects Here. For The CD Image Format, See
3 pages
Note File Org
No ratings yet
Note File Org
28 pages
Computational Intelligence To Aid Text F
No ratings yet
Computational Intelligence To Aid Text F
14 pages
Static Analysis
100% (1)
Static Analysis
39 pages
Data Format Compare
From Everand
Data Format Compare
Frank Wellington
No ratings yet
Lecs 102
No ratings yet
Lecs 102
20 pages
Linuxfile
No ratings yet
Linuxfile
9 pages
C.1 FileTypeIdentification ALiteratureReview
No ratings yet
C.1 FileTypeIdentification ALiteratureReview
6 pages
Operating Systems/ File Systems and Management Lecture Notes
No ratings yet
Operating Systems/ File Systems and Management Lecture Notes
21 pages
A Study of Detecting Computer Viruses in Real-Infected Files in The N-Gram Representation With Machine Learning Methods
No ratings yet
A Study of Detecting Computer Viruses in Real-Infected Files in The N-Gram Representation With Machine Learning Methods
15 pages
Binary File Format Analysis
100% (3)
Binary File Format Analysis
19 pages
Chapter 4 File System
No ratings yet
Chapter 4 File System
21 pages
C.1 FileTypeIdentification ALiteratureReview
No ratings yet
C.1 FileTypeIdentification ALiteratureReview
6 pages
Files Chapter Seven
No ratings yet
Files Chapter Seven
24 pages
Lecs 102
No ratings yet
Lecs 102
20 pages
Faculty e Notes Unit 4 Ds 2
No ratings yet
Faculty e Notes Unit 4 Ds 2
32 pages
23CS101T PSPP - Unit 5
No ratings yet
23CS101T PSPP - Unit 5
32 pages
PPS Unit 6 Notes
No ratings yet
PPS Unit 6 Notes
23 pages
6.File Management
No ratings yet
6.File Management
57 pages
File Handling in Python
No ratings yet
File Handling in Python
20 pages
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
JavaScript File Handling from Scratch: A Practical Guide with Examples
From Everand
JavaScript File Handling from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
Cit381 Calculus Educational Consult 2021 - 1
No ratings yet
Cit381 Calculus Educational Consult 2021 - 1
43 pages
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
From Everand
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
Christopher Right
2.5/5 (2)
Document 9 Section 1 Helper
No ratings yet
Document 9 Section 1 Helper
8 pages
lecs102
No ratings yet
lecs102
21 pages
MODULE 2
No ratings yet
MODULE 2
7 pages
ASCII
No ratings yet
ASCII
7 pages
File Organization
No ratings yet
File Organization
7 pages
INI Format Explained
From Everand
INI Format Explained
Isabella Ramirez
No ratings yet
Go File Handling for New Coders: A Practical Guide with Examples
From Everand
Go File Handling for New Coders: A Practical Guide with Examples
William E. Clark
No ratings yet
File Naming Convention For Time Sequence Data
No ratings yet
File Naming Convention For Time Sequence Data
6 pages
lecs102
No ratings yet
lecs102
20 pages
Chapter No 6 File Management
No ratings yet
Chapter No 6 File Management
50 pages
1749284205433_FACULTY_E_NOTES_UNIT_4_DFS
No ratings yet
1749284205433_FACULTY_E_NOTES_UNIT_4_DFS
32 pages
Mastering Computer Programming: A Comprehensive Guide
From Everand
Mastering Computer Programming: A Comprehensive Guide
Kondwani Hara
No ratings yet
File Formats- Characterization and Validation
No ratings yet
File Formats- Characterization and Validation
6 pages
TOML Config Basics
From Everand
TOML Config Basics
Frank Wellington
No ratings yet
Csc0222024
No ratings yet
Csc0222024
20 pages
File Command
No ratings yet
File Command
5 pages
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
Mastering Python in 7 Days
From Everand
Mastering Python in 7 Days
Alex Wood
No ratings yet
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
Addax_UK_4603
No ratings yet
Addax_UK_4603
4 pages
Addax Presentation
No ratings yet
Addax Presentation
31 pages
Addax_open_EN_01
No ratings yet
Addax_open_EN_01
4 pages
Document (1)
No ratings yet
Document (1)
4 pages
Rain Dance by Alice Gomez and Marilyn Rife PDF
No ratings yet
Rain Dance by Alice Gomez and Marilyn Rife PDF
6 pages
CHAPTER - 2 Piers and Caissons
No ratings yet
CHAPTER - 2 Piers and Caissons
43 pages
UOP 832
No ratings yet
UOP 832
7 pages
PARTNERSHIP LETTER-PTA & POW 2025
No ratings yet
PARTNERSHIP LETTER-PTA & POW 2025
3 pages
Modeling and Simulation
No ratings yet
Modeling and Simulation
11 pages
Walchandnagar Industries Limited Walchandnagar: Page No.1/11
No ratings yet
Walchandnagar Industries Limited Walchandnagar: Page No.1/11
11 pages
Boltovskoy Lfbook Impacts Intro
No ratings yet
Boltovskoy Lfbook Impacts Intro
20 pages
Best
0% (1)
Best
4 pages
CMI Market Model
No ratings yet
CMI Market Model
24 pages
Letters 2019
No ratings yet
Letters 2019
192 pages
Cosonic (Radial Thru-Hole) RH Series
No ratings yet
Cosonic (Radial Thru-Hole) RH Series
2 pages
GTM OS Operating System Report
No ratings yet
GTM OS Operating System Report
12 pages
Role of Human Resource Management Strategy in Organizational Performance in Kenya
No ratings yet
Role of Human Resource Management Strategy in Organizational Performance in Kenya
7 pages
X350 Manual
No ratings yet
X350 Manual
40 pages
Partograph Review
No ratings yet
Partograph Review
7 pages
Geography New Remedial Module-1 (1)[1]
100% (1)
Geography New Remedial Module-1 (1)[1]
282 pages
Week 10.1. Gifted Class
No ratings yet
Week 10.1. Gifted Class
5 pages
Company Profile Leads
No ratings yet
Company Profile Leads
17 pages
BYGPB5152H Bollu RamaKrishna OBIEE Developer Dynpro Ak
No ratings yet
BYGPB5152H Bollu RamaKrishna OBIEE Developer Dynpro Ak
6 pages
C
No ratings yet
C
887 pages
New Mat Brochure
No ratings yet
New Mat Brochure
11 pages
Learn Enough JavaScript
100% (1)
Learn Enough JavaScript
58 pages
Gap Analysis New Logo Pre TX Testing Guidelines Final Feb 2013 (2)
No ratings yet
Gap Analysis New Logo Pre TX Testing Guidelines Final Feb 2013 (2)
11 pages
About How Much Money Do You Spend On Clothes A Year
No ratings yet
About How Much Money Do You Spend On Clothes A Year
4 pages
M21B Package Unit (Including Shipped-Loose Items)
No ratings yet
M21B Package Unit (Including Shipped-Loose Items)
3 pages
SKAA 4943 Topic 3 Storage
No ratings yet
SKAA 4943 Topic 3 Storage
45 pages
QUESTER COMPACTOR Quester - E44 - Spec - Sheet
No ratings yet
QUESTER COMPACTOR Quester - E44 - Spec - Sheet
2 pages

Fileprints Identifying File Types by n-gram Analysis

Uploaded by

Fileprints Identifying File Types by n-gram Analysis

Uploaded by

Proceedings of the 2005 IEEE

Workshop on Information Assurance

Fileprints: Identifying File Types by n-gram Analysis

Wei-Jen Li, Ke Wang, Salvatore J. Stolfo, Benjamin Herzog Columbia University

ISBN 555555555/$10.00  2005 IEEE

Indeed, they report 3 variant algorithms achieving an

There is also a significant difference in the method used

Several modeling strategies are explored. First, we use a

ISBN 555555555/$10.00  2005 IEEE

Figure 4: The bytes distribution of DLL and EXE files.

The second row of Figure 4 presents another example of

On the other hand, files with the same extension do not

ISBN 555555555/$10.00  2005 IEEE

Size gram distribution of an individual file and hence there is

2. Multi-centroids for classifying file types

ISBN 555555555/$10.00  2005 IEEE

modeling techniques should be explored. The truncation

This work has been partially supported by an SBIR

[1] FileAlyzer, http://www.safer-

[2] FILExt – the file extension source http://filext.com/

[3] C. Nachenberg. “Polymorphic virus detection

[4] P. Szor and P. Ferrie. “Hunting for metamorphic”. In

[5] Ke Wang, Salvatore J. Stolfo. “Anomalous Payload-

[6] McDaniel and M. Hossain Heydari. “Content Based

[7] Matthew G. Schultz, Eleazar Eskin, and Salvatore J.

[8] Jeremy Kolter and Marcus A. Maloof. “Learning to

[9] M. Damashek. “Gauging similarity with n-grams:

ISBN 555555555/$10.00  2005 IEEE

You might also like