0% found this document useful (0 votes)
6 views10 pages

Content Based File Type Detection Algorithms

This paper presents algorithms for content-based file type detection that generate 'fingerprints' based on known input files, allowing for the identification of unknown files by analyzing their content rather than relying on unreliable metadata like file extensions. Three algorithms are proposed: byte frequency analysis, byte frequency cross-correlation analysis, and file header/trailer analysis, with accuracy rates ranging from 23% to 96%. The automated nature of these algorithms aims to enhance security in systems by accurately identifying file types, which is crucial for virus protection and forensic analysis.

Uploaded by

ali.alali.hctc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Content Based File Type Detection Algorithms

This paper presents algorithms for content-based file type detection that generate 'fingerprints' based on known input files, allowing for the identification of unknown files by analyzing their content rather than relying on unreliable metadata like file extensions. Three algorithms are proposed: byte frequency analysis, byte frequency cross-correlation analysis, and file header/trailer analysis, with accuracy rates ranging from 23% to 96%. The automated nature of these algorithms aims to enhance security in systems by accurately identifying file types, which is crucial for virus protection and forensic analysis.

Uploaded by

ali.alali.hctc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Proceedings of the 36th Hawaii International Conference on System Sciences - 2003

Content Based File Type Detection Algorithms


Mason McDaniel and M. Hossain Heydari1
Computer Science Department
James Madison University
Harrisonburg, VA 22802

Abstract by the file’s extension. This is an extremely


unreliable method, as any user or application can
Identifying the true type of a computer file can change a file’s name and extension at any time. As a
be a difficult problem. Previous methods of file type result, some users are able to conceal files from
recognition include fixed file extensions, fixed system administrators simply by renaming them to a
“magic numbers” stored with the files, and filename with a different extension. While this
proprietary descriptive file wrappers. All of these doesn’t conceal the existence of a file, it can conceal
methods have significant limitations. This paper the nature of a file and can prevent it from being
proposes algorithms for automatically generating opened by the operating system. In addition, many
“fingerprints” of file types based on a set of known virus-scanning packages default to only scanning
input files, then using the fingerprints to recognize executable files. These packages may miss any
the true type of unknown files based on their content, viruses contained within executable files that had
rather than metadata associated with them. non-executable file extensions. This could introduce
Recognition is performed by three different vulnerabilities into a network, even if it contained
algorithms based on: byte frequency analysis, byte virus protection.
frequency cross-correlation analysis, and file The other common method of identifying file
header/trailer analysis. Tests were run to measure types is through manual definition of file recognition
the accuracy of these algorithms. The accuracy rules. This is an extremely time-consuming process,
varied from 23% to 96% depending upon which whereby an individual examines a file type
algorithm was used. specification, if one is available, and identifies
These algorithms could be used by virus consistent features of a file type that can be used as a
scanning packages, firewalls, intrusion detection unique identifier of that type. In the absence of a
systems, forensic analyses of computer hard drives, specification, the individual must manually examine
web browsers, or any other program that needs to a number of files looking for common features that
identify the types of files for proper operation. File can be used to identify the file type. Not only is this
type detection is also important to the operating time-consuming, but it can require an individual with
systems for correct identification and handling of a highly technical background that is capable of
files regardless of file extension. doing a hexadecimal analysis of files.
Manual rule definition is the method used by
many Unix-based operating systems, as well as tools
I. Introduction used in forensic analysis of computer disks during
Computers use a tremendous array of file investigations. Regardless of the investigating
formats today. All types of files are frequently authority, automated file type recognition is a critical
transmitted through intranets and the Internet. part of this sort of computer forensic analysis.
Currently, operating systems, firewalls, and intrusion An efficient, automated algorithm to perform this
detection systems have very few methods for kind of file type recognition would be of tremendous
determining the true type of a file. Perhaps the most benefit to organizations needing to perform forensic
common method used is to identify the type of a file

1
This research is supported in part by a grant from the Virginia Commonwealth Technology Research Fund and
supported in part by a grant from the Department of Defense Information Assurance Scholarship program.

0-7695-1874-5/03 $17.00 (C) 2003 IEEE 1


Proceedings of the 36th Hawaii International Conference on System Sciences - 2003

analyses of computer hard drives. It could also be information regarding their type. One example of
used by virus protection software, intrusion detection this approach is the Standard File Format (SAF)
systems, firewalls, web browsers, and security developed by the Advanced Missile Signature Center
downgrading packages to identify the true nature of (AMSC) [6]. There are many down sides to this
programs passing through the protected systems. approach. The specification must be written defining
Finally, this kind of algorithm could be of use to the how to encapsulate and identify each file format. An
operating systems themselves to allow for correct individual or external system must identify the type
identification and handling of files regardless of file of the file before it can be correctly encapsulated in
extension. the standard format in the correct manner. The most
This paper describes an attempt to extend the significant problem, however, is that this type of file
concept of frequency analysis and apply it to the can only be used within the small proprietary system
general case of generating a characteristic that recognizes the “standard” format. The files
“fingerprint” for different computer file types, and cannot be exported to external systems such as the
subsequently using the fingerprint to identify file Internet without removing the encapsulation, and thus
types based upon their characteristic signatures. The negating its benefit.
process could be almost entirely automated, and
would not be affected if a user changed a file name or II. Algorithms
extension.
The design goals for the proposed file recognition
I.1 Previous work algorithm are as follows:
• Accuracy – The algorithm should be as accurate as
To date, there have been relatively few methods possible at identifying file types.
for identifying the type of a file. One of the most • Automatic generation of file type fingerprints.
commonly used methods is the use of file extensions. • Small fingerprint files – The fingerprint file sizes
Microsoft’s operating systems use this method almost should be minimized.
exclusively. They come preset with associations • Speed – Comparisons should be as fast as possible
between file extensions and file types. If different for a given fingerprint file size.
associations are desired, they must be manually • Flexibility – The algorithm should provide a
reconfigured by the user [7]. As mentioned above, customizable tradeoff between speed and accuracy.
this approach introduces many security • Independence from file size.
vulnerabilities. A user can change the extension of a These design goals can be achieved by implementing
file at any time, rendering the operating system the three algorithms described in this paper, each of
unable to identify it. They can also change the file which could be selected independently, or used
extension associations to fool the operating system together for increased accuracy. Due to space
into handling files in an inappropriate manner, such limitation, detailed explanation of these results is
as trying to execute a text file. available in [1].
Another approach is that taken by many Unix-
based operating systems. These make use of a II.1 Byte frequency analysis (BFA) algorithm
“magic number” which consists of the first 16 bits of
each file. A file, such as /etc/magic then associates A computer file is a collection of bytes, which
magic numbers with file types [9]. This approach has correspond to eight-bit numbers capable of
a number of drawbacks as well. The magic numbers representing numeric values from 0 to 255 inclusive.
must be predefined before the files are generated, and By counting the number of occurrences of each byte
are then built into the files themselves. This makes it value in a file, a frequency distribution can be
very difficult to change them over time, since a obtained. Many file types have consistent patterns to
change might interfere with the proper operation of their frequency distributions, providing information
many files that were generated using the old magic useful for identifying the type of unknown files.
number. Furthermore, not all file types use magic Figure II.1 and Figure II.2 show the frequency
numbers. The scheme was initially intended to assist distributions for a typical RichText (RTF) and a
with the proper handling of executable and binary Graphics Interchange Format (GIF) file, respectively.
formats. With only 16 bits allocated, a number of Many file types likewise have characteristic patterns
extensions had to be introduced over time, such as that can be used to differentiate them from other file
using the “#!” magic number to identify a command formats.
to execute on the rest of the file [8]. This section describes the methods used to build
Another approach is to define a proprietary file the byte frequency distribution of individual files and
format that encapsulates other files and provides

0-7695-1874-5/03 $17.00 (C) 2003 IEEE 2


Proceedings of the 36th Hawaii International Conference on System Sciences - 2003

to construct a fingerprint representative of the file more of the detail across all byte frequencies, and
type. therefore may allow for more accurate comparisons.
Experimental results indicated that β = 1.5 is the
II.1.1 Building the byte frequency distribution optimal β value for the most accurate file type
recognition [1]. The optimal value of β is defined as
The first step in building a byte frequency the value that produces the greatest difference
fingerprint is to count the number of occurrences of between the fingerprint with the highest frequency
each byte value for a single input file. This is done score and the fingerprint with the second-highest
by constructing an array of size 256 (indexed 0 to frequency score.
255), and initializing all array locations to zero. For
each byte in the file, the appropriate element of the
array is incremented by one. Once the number of
occurrences of each byte value is obtained, each
element in the array is divided by the number of
occurrences of the most frequent byte value. This
normalizes the array to frequencies in the range of 0
to 1, inclusive. This normalization step prevents one
very large file from skewing the file type fingerprint.
Rather, each input file is provided equal weight
regardless of size.
Some file types have some byte values that occur Figure II.2 - Byte frequency distributions for two
much more frequently than any other. If this GIF files.
happens, the normalized frequency distribution may
show a large spike at the common values. Figure II.3
shows the frequency distribution for an executable
file that demonstrates this. The file has large regions
filled with the byte value zero. The resulting graph
has a large spike at byte value zero, with insufficient
detail to determine patterns in the remaining byte
value ranges.

Figure II.3 - Frequency distribution for a sample


executable file.

The companding function results in a frequency


distribution that is still normalized to 1. This is true
since the most frequent byte value was normalized to
Figure II.1 - Byte frequency distributions for two 1, and the companding function with an input value
RTF files. of 1 results in an output value of 1.

A way to solve this problem would be to pass the II.1.2 Combining frequency distributions into a
frequency distribution through a companding fingerprint
function to emphasize the lower values. Common
companding functions, such as the A-law and µ-law A fingerprint is generated by averaging the
companding functions used in telecommunications results of multiple files of a common file type into a
[2], can be roughly approximated by the following single fingerprint file that is representative of the file
function, which can be very rapidly computed. type as a whole. To add a new file’s frequency
The same file shown in Figure II.3, after being distribution to a fingerprint we use the following
passed through this equation, produces the frequency simple averaging equation, where NFPS is the new
distribution shown in Figure II.5. This graph shows fingerprint score, OFPS is the old fingerprint score,

0-7695-1874-5/03 $17.00 (C) 2003 IEEE 3


Proceedings of the 36th Hawaii International Conference on System Sciences - 2003

PNF is the previous number of files, and NFS is the Experimental results indicated that σ = 0.0375 is
new file score.
the optimal σ value for the most accurate file type
recognition [1].
NFPS =
(OFPS × PNF ) + NFS
PNF + 1
Aside from the byte frequency distributions,
there is another related piece of information that can
be used to refine the comparisons. The frequencies
of some byte values are very consistent between files
of some file types, while other byte values vary
widely in frequency. For example, note that almost
all of the data in the files shown in Figure II.1 lie
between byte values 32 and 126, corresponding to
printable characters in the lower ASCII range. This
is characteristic of the RichText format. On the other
hand, the data within the byte value range
corresponding to the ASCII English alphanumeric
characters varies widely from file to file, depending
upon the contents of the file.
This suggests that a “correlation strength” Figure II.5 - Frequency distribution for a the
between the same byte values in different files can be figure II.3 file after passing through the
measured, and used as part of the fingerprint for the companding function.
byte frequency analysis. In other words, if a byte
value always occurs with a regular frequency for a Once the input file’s correlation factor for each
given file type, then this is an important feature of the byte value is obtained, these values need to be
file type, and is useful in file type identification. combined with the correlation strengths in the
A correlation factor can be calculated by fingerprint. This is accomplished by using the
comparing each file to the frequency scores in the following simple averaging equation, which directly
fingerprint. The correlation factors can then be parallels the method used to calculate the frequency
combined into an overall correlation strength score distribution scores, where NCS is the new correlation
for each byte value of the frequency distribution. strength, OCS is the old correlation strength, PNF is
The correlation factor of each byte value for an the previous number of files, and NCF is tne new
input file is calculated by taking the difference correlation factor.
between that byte value’s frequency score from the
NCS =
(OCS × PNF ) + NCF
input file and the frequency score from the
PNF + 1
fingerprint. If the difference between the two
II.1.3 Comparing a single file to a fingerprint
frequency scores is very small, then the correlation
strength should increase toward 1. If the difference is
large, then the correlation strength should decrease When identifying a file using the byte frequency
toward 0. Therefore, if a byte value always occurs analysis algorithm (BFA):
with exactly the same frequency, the correlation Compute a score for each fingerprint identifying
strength should be 1. If the byte value occurs with how closely the unknown file matches the
widely varying frequencies in the input files, then the frequency distribution in the fingerprint. The score
correlation strength should be nearly 0. is generated by comparing each byte value
A function that would provide more tolerance for frequency from the unknown file with the
small variations and less tolerance for larger corresponding byte value frequency from the
variations is a bell curve with a peak magnitude of 1 fingerprint. As the difference between these values
and the peak located at 0 on the horizontal axis. The decreases, the score should increase toward 1. As
general equation for this type of bell curve is: the difference increases, the score should decrease
toward 0.
 − x2 
  Compute an “assurance level” for each fingerprint,
 2σ 2  indicating how much confidence can be placed on
F ( x) = e 
the score. The file type’s byte frequency
where F(x) is the correlation factor and x is the correlation strengths are used to generate a numeric
difference between the new byte value frequency and rating for the assurance level. This is because a file
the average byte value frequency in the fingerprint. type with a characteristic byte frequency

0-7695-1874-5/03 $17.00 (C) 2003 IEEE 4


Proceedings of the 36th Hawaii International Conference on System Sciences - 2003

distribution will have high correlation strengths for


many byte values.
Compare the unknown file’s byte frequency
distribution to the byte frequency scores and the
associated correlation strengths stored in each file
type fingerprint and pick the best match.

Figure II.7 shows the byte frequency distribution


for the HTML fingerprint, with the frequency scores
shown by a solid line and the correlation strengths Figure II.8 - Byte frequency distribution with
shown by a dotted line. Figure II.8 shows the byte correlation strength for ZIP fingerprint.
frequency distribution scores and correlation
strengths for the ZIP fingerprint. II.2.1 Building the byte frequency cross-
Using this scheme, the HTML file format would correlation
have a high assurance level for the byte frequency,
since many byte values have high correlation There are two key pieces of information that
strengths, whereas the ZIP file format would have a need to be calculated concerning the byte frequency
low assurance level for the byte frequency, cross-correlation analysis: the average difference in
suggesting that perhaps other algorithms should frequency between all byte pairs and a correlation
be used to improve accuracy for this type. strength similar to the BFA algorithm. Byte value
pairs that have very consistent frequency
II.2 Byte frequency cross-correlation (BFC) relationships across files, such as byte values 60 and
algorithm 62 in HTML files, as mentioned above, will have a
high correlation strength score. Byte value pairs that
While BFA algorithm compares overall byte have little or no relationship will have a low
frequency distributions, other characteristics of the correlation strength score.
frequency distributions are not addressed. One In order to characterize the relationships between
example can be seen in Figure II.7. There are two byte value frequencies, a two-dimensional 256×256
equal-sized spikes in the solid frequency scores at cross-correlation array is built (byte values are
byte values 60 and 62, which correspond to the between 0 and 255), with indices ranging from 0 to
ASCII characters “<” and “>” respectively. Since 255 in each dimension.
these two characters are used as a matched set to Note that if byte value i is being compared to
denote HTML tags within the files, they normally byte value j, then array entry (i, j) contains the
occur with nearly identical frequencies. frequency difference between byte values i and j
This relationship, or cross-correlation, between while array entry (j, i) contains the negative of the
byte value frequencies can be measured and scored as corresponding (i, j) location. Hence, half of the array
well, strengthening the identification process. This contains redundant information and storing both of
section describes the methods used to build the byte them is unnecessary. We use the lower half of the
frequency cross-correlations of individual files, to array to store the correlation strengths of each byte
construct a fingerprint representative of the file type, value pair. So now if byte value i is being compared
and to compare an unknown file to a file type to byte value j, then array entry (i, j) contains the
fingerprint, obtaining a numeric score. frequency difference between byte values i and j
while array entry (j, i) contains the correlation
strength for the byte pair. Furthermore, a byte value
will always have an average frequency difference of
0 and a correlation strength of 1 with itself, so the
main diagonal of the array can be used to store any
other information that is needed for the comparisons.
We use the first entry of the main diagonal (0, 0) to
store the number of files that have been used to
compute the fingerprint.
Calculating the difference between the
frequencies of two bytes with values i and j involves
Figure II.7 - Byte frequency distribution with simply subtracting the frequency score of byte value i
correlation strength for HTML fingerprint. from the frequency of byte value j. Since byte value

0-7695-1874-5/03 $17.00 (C) 2003 IEEE 5


Proceedings of the 36th Hawaii International Conference on System Sciences - 2003

frequencies were normalized, with a range of 0 to 1, cross-correlation array. Note that there are ranges of
this results in a number with a possible range of –1 to byte values in the frequency distribution that never
1. A score of –1 indicates that the frequency of byte occurred in any files (they appear with 0 frequency.)
value i was much greater than the frequency of byte These regions appear in the cross-correlation plot as
value j. A score of 1 indicates that the frequency of mid-tone gray regions of 0 frequency difference.
byte value i was much less than the frequency of byte Furthermore, the intersection of 60 on the vertical
value j. A score of 0 indicates that there was no axis with 62 (corresponding to the ASCII values for
difference between the frequencies of the two byte the “<” and “>”) shows a dark dot representing a
values. correlation strength of 1, as expected.
Looking at a graphical plot of a GIF fingerprint
II.2.2 Combining cross-correlations into a cross-correlation array (not shown to save space), a
fingerprint sawtooth pattern in the frequency distributions is a
characteristic feature of the GIF file type, and it
Once the frequency differences between all byte- manifests in the cross-correlation plot as a subtle grid
value pairs for an input file have been calculated, the pattern in the frequency difference region.
new fingerprint can be calculated using the following
equation, similar to the one used in BFA algorithm, II.2.3 Comparing a single file to a fingerprint
where NFPD is the new fingerprint difference, OFPD
is the old fingerprint difference, NFD is the new When identifying a file using the byte frequency
frequency difference, and PNF is the previous cross-correlation algorithm (BFC):
number of files. Compute a score, similar to BFA, for each
NFPD =
(OFPD × PNF ) + NFD fingerprint identifying how closely the unknown
PNF + 1 file matches the fingerprint. The score is generated
A correlation factor can be calculated for each by comparing the frequency difference for each
byte value pair, by comparing the frequency byte value pair from the unknown file with the
differences in the input file to the frequency average frequency difference for the corresponding
differences in the fingerprint. The correlation factors byte value pair from the fingerprint. As the
can then be combined with the scores already in the difference between these values decreases, the
fingerprint to form an updated correlation strength score should increase toward 1. As the difference
score for each byte value pair. As more files are increases, the score should decrease toward 0.
added to construct the fingerprint, the correlation Compute the assurance level, indicating how much
strengths more accurately reflect the file type. confidence can be placed on the score. File types
If at least one file has been previously added into that have characteristic cross-correlation patterns
a fingerprint, then the correlation factor for each byte should have high assurance levels, others should
value pair is calculated by subtracting the pair’s have low assurance levels. As with the BFA
frequency difference from the new file and the same algorithm, the correlation strengths are used to
pair’s average frequency difference from the generate a numeric rating for the assurance level.
fingerprint. This results in a new overall difference The higher the assurance level, the more weight
between the new file and the fingerprint. If this can be placed on the score for that fingerprint.
overall difference is very small, then the correlation Compare the unknown file’s cross-correlation array
strength should increase toward 1. If the difference is to the cross-correlation scores and correlation
large, then the correlation strength should decrease strengths stored in each file type fingerprint and
toward 0. New correlation strength is calculated pick the best match.
using the same equations as the BFA algorithm.
After the average frequency differences and II.3 File header/trailer (FHT) algorithm
correlation strengths for each byte value pair of the
new input file have been updated in the fingerprint, BFA and BFC make use of byte value
the Number of Files field is incremented by 1 to frequencies to characterize and identify file types.
indicate the addition of the new file. While these characteristics can effectively identify
It is interesting to compare the frequency many file types, some do not have easily identifiable
distribution graphs of BFA algorithm to the byte patterns. To address this, the file headers and file
frequency cross-correlation plots generated from trailers can be analyzed and used to strengthen the
BFC algorithm. Figure II.7 shows the frequency recognition of many file types. The file headers and
distribution for the HTML file format, and Figure II.9 trailers are patterns of bytes that appear in a fixed
shows a graphical plot of the HTML fingerprint location at the beginning and end of a file

0-7695-1874-5/03 $17.00 (C) 2003 IEEE 6


Proceedings of the 36th Hawaii International Conference on System Sciences - 2003

respectively. These can be used to dramatically type. The first few bytes of the GIF header show high
increase the recognition ability on file types that do correlation strengths (represented by dark marks,)
not have strong byte frequency characteristics. indicating that this type has a strongly characteristic
This section describes the methods used to build file header. The specification for the GIF format
the header and trailer profiles of individual files, to states that the files shall all begin with the text string
combine the ratings from multiple files into a “GIF87a” for an earlier version of the format, or
fingerprint for the file type, and to compare an “GIF89a” for a later version. Further inspection of
unknown file to a file type fingerprint, obtaining a Figure II.11 shows that rows 0-3 and 5 have
numeric score. correlation strengths of 1 for the byte value positions
corresponding to ASCII values of “GIF8” and “a”.
II.3.1 Building the header and trailer profiles In row four, byte values 55 and 57 (ASCII values for
“7” and “9” ) both show correlation strengths roughly
The first step in building header and trailer balanced. This indicates that approximately equal
profiles is to decide how many bytes from the numbers of files of each version of the GIF format
beginning and end of the file will be analyzed. If H were loaded into the fingerprint. Beyond byte
is the number of file header bytes to analyze, and T is position six, there is a much broader distribution of
the number of trailer bytes to analyze, then two two- byte values, resulting in lower correlation strengths
dimensional arrays are built, one of dimensions H × and lighter marks on the plot.
256 and the other of dimensions T × 256. For each
byte position in the file header (trailer), all 256 byte
values can be independently scored based upon the
frequency with which the byte value occurs at the
corresponding byte position.
An individual file’s header array is initially set to
0. For each byte position in the header, from byte 0
(the first byte in the file) to byte H – 1, the array
entry corresponding to the value of the byte is filled
with a correlation strength of 1 (each row has 255
zeros and a single one). The only exception occurs
when an input file is shorter than the header or trailer
lengths. In this case, the fields in the missing byte
position rows will be filled with the value -1 to
signify no data. (Note that if a file length is greater
than the header and trailer lengths, but less then the
sum of the two lengths, then the header and trailer
regions will overlap.) The trailer array is similarly
constructed.

II.3.2 Combining header and trailer Profiles into a


fingerprint
Figure II.9 - Byte frequency cross-correlation plot
A fingerprint is constructed by averaging the for the HTML file type
correlation strength values from each file into the
fingerprint using the following equation, which is Figure II.12 shows a very similar plot of the file
similar to the ones used in BFA and BFC algorithms, trailer for the MPEG file type fingerprint, where the
where NFPA is the new fingerprint array entry, end of the file is represented by byte position 0 at the
OFPA is the old fingerprint array entry, PNF is the bottom of the plot. This plot shows a broad
previous number of files, and NA is the new array distribution of byte values (resulting in extremely
entry. faint marks) up until four bytes from the end of the

NFPA =
(OFPA × PNF ) + NA file. These final four bytes show a characteristic
pattern similar to the pattern described above for the
PNF + 1
GIF file header.
Note that for file types that do not have a
A sample graphical plot of the file header
characteristic file header or trailer, the corresponding
fingerprint array is shown in Figure II.11 (please note
plots would appear essentially empty, with many
the very light markings on the figure) for the GIF file

0-7695-1874-5/03 $17.00 (C) 2003 IEEE 7


Proceedings of the 36th Hawaii International Conference on System Sciences - 2003

scattered dots with very low correlation strengths indication of file type. Therefore even a few bytes
(therefore producing almost white dots.) can produce a strong influence in recognition. On the
other hand, if a file type has no consistent file header
II.3.3 Comparing a single file to a fingerprint or trailer, the maximum correlation strength will be
very low. This means little weight will be placed on
When identifying a file using the file headers and the header or trailer with the low assurance level.
trailers algorithm (FHT):
Construct the file header and trailer arrays for the
unknown file as described above.
Use the following equation to generate the score
for the file header and trailer, where C is the
correlation strength for the byte value extracted
from the input file for each byte position, and G is
the correlation strength of the byte value in the
fingerprint array for the corresponding byte
position. This equation produces an average of the
correlation strengths of each byte value from the
input file, weighted by the greatest correlation
strength at each byte position. This results in
placing greatest weight on those byte positions
with a strong correlation, indicating that they are
part of a characteristic file header or trailer, and
placing much less weight (ideally no weight) on
values where the file type does not have consistent Figure II.11 - File header plot for the GIF file
values. fingerprint
C G + C 2G 2 + K + C nG n
S = 1 1
G1 + G 2 + K + G n The optimal header (trailer) length is the value
The assurance level for the file header and file that results in the highest average level of
trailer is simply set equal to the overall maximum differentiation across all file types. Our experimental
correlation strength in the header and trailer arrays, results indicate that the optimum header and trailer
respectively. This is different from the approach length for file type identification is five [1].
used in BFA and BFC algorithms, where the
average of all correlation strengths was used. III. Experimental results
Compare the unknown file’s header/trailer
information to the cross-correlation scores and In this section we describe our experimental
correlation strengths stored in each file type results, using each of the 3 above-mentioned
fingerprint and pick the best match. algorithms to identify file types. Thirty file type
fingerprints are constructed and used for this test. To
The GIF file header provides a clear example. run the accuracy test, four test files are selected for
The first four byte positions each have a correlation each file type, resulting in a total library of 120 files.
strength of 1 for a single byte. This indicates that all Combining the file types ACD, DOC, PPT, and XLS
input files of the GIF file type had the same byte into a single OLE DOC fingerprint, using the average
values for these positions. If an unknown file has of the four type fingerprints, resulted in a more
different bytes in these positions, it is a very strong accurate type recognition for BFC and FHT
indicator that it is not a GIF file. On the other hand, algorithms and a slight decrease for BFA. Following
if the unknown file has a differing byte value at shows the accuracy test results for each of the 3
position 20, which shows a very low correlation algorithms. Type recognition reports were generated
strength, this offers no real information about for each of the 120 test files:
whether the unknown file is a GIF file or not since • Figure III.1 shows the resulting file type
there are no bytes in this position with a high identification grid for BFA algorithm. BFA’s
correlation strength. accuracy is only 27.50%. This is better than purely
Setting the assurance level equal to the random guesses but not accurate enough for
maximum correlation strength allows even a few practical use. We should note that the accuracy of
bytes with very high correlation strength, such as this algorithm increases to 29.17% if separate
those in the GIF file format to provide a strong

0-7695-1874-5/03 $17.00 (C) 2003 IEEE 8


Proceedings of the 36th Hawaii International Conference on System Sciences - 2003

fingerprints are used for ACD, DOC, PPT, and performed in this algorithm, though, are used as the
XLS files, not a significant improvement. basis for the BFC.
• Figure III.2 shows the resulting file type The BFC algorithm proved to be by far the
identification grid for BFC algorithm. BFC’s slowest and only moderately more accurate than
accuracy is only 45.83%. This is a significant BFA. An unknown file takes an average of 1.19
improvement over BFA, but not accurate enough seconds to compare to 25 fingerprints and identify
for practical use. the closest match. This algorithm offers slightly
• Figure III.3 shows the resulting file type improved accuracy over BFA but its accuracy is still
identification grid for FHT algorithm. FHT’s too low to be of practical use in most applications.
accuracy is 95.83%. This is a significant The FHT algorithm provides the best
improvement over BFA and BFC and may be combination of speed and accuracy. An unknown
accurate enough for some fault-tolerant file takes an average of 0.015 seconds to compare to
applications. We should note that using separate 25 fingerprints and identify the closest match, which
fingerprints for ACD, DOC, PPT, and XLS files is almost as fast as BFA. This algorithm had by far
decreases FHT’s accuracy to 85%, most of the the highest accuracy at 95.83% for a combined OLE
errors occurred between the ACD, DOC, PPT, and DOC fingerprint and 85% for separate ACD, DOC,
XLS file type identification. PPT, and XLS fingerprints.

File 1 File 2 File 3 File 4 Score File 1 File 2 File 3 File 4 Score
3TF 3TF 3TF 3TF 3TF 4 3TF 3TF 3TF 3TF 3TF 4
ACD 3TF 3TF OLE OLE 2 ACD OLE OLE OLE OLE 4
AVI 3TF CRP RM 3TF 0 AVI OLE CAT OLE 3TF 0
BMP 3TF 3TF FNT 3TF 0 BMP 3TF 3TF TTF 3TF 0
CAT CAT CAT CAT CAT 4 CAT CAT CAT CAT CAT 4
CRP CRP CRP CRP CRP 4 CRP CRP CRP CRP CRP 4
DOC WPD 3TF 3TF 3TF 0 DOC OLE OLE OLE OLE 4
EXE FNT 3TF 3TF CRP 0 EXE OLE OLE OLE OLE 0
FNT 3TF 3TF 3TF GIF 0 FNT 3TF 3TF 3TF 3TF 0
GIF RM ZIP RM RM 0 GIF GIF GIF 3TF 3TF 2
GZ MP3 TAR ZIP CRP 0 GZ 3TF 3TF 3TF 3TF 0
HTML RTF TXT CAT CAT 0 HTML RTF TXT CAT CAT 0
JPG JPG GZ MP3 MP3 1 JPG 3TF 3TF 3TF 3TF 0
MDL CAT CAT CAT CAT 0 MDL MDL CAT MDL MDL 3
MOV CRP CRP RM RM 0 MOV GIF GIF 3TF 3TF 0
MP3 MP3 GZ MP3 MP3 3 MP3 3TF MP3 MP3 MP3 3
MPEG CRP CRP MP3 CRP 0 MPEG 3TF MPEG 3TF OLE 1
PDF CRP PDF EXE TXT 1 PDF PDF PDF PDF TXT 3
PPT 3TF 3TF 3TF 3TF 0 PPT OLE OLE OLE OLE 4
PS TXT TXT CAT TXT 0 PS TXT TXT CAT TXT 0
RTF RTF RTF RTF CAT 3 RTF RTF TXT RTF TXT 2
RM RM CRP RM CRP 2 RM OLE RM RM RM 3
RPM GZ CRP GZ GZ 0 RPM RPM OLE RPM RPM 3
TAR CRP CAT TXT ZIP 0 TAR OLE CAT 3TF RPM 0
TXT TXT CAT TXT TXT 3 TXT TXT CAT TXT TXT 3
TTF TTF TTF TTF WPD 3 TTF TTF TTF OLE OLE 2
WAV CAT TXT FNT 3TF 0 WAV TXT TXT TXT TXT 0
WPD 3TF 3TF WPD TXT 1 WPD WPD WPD WPD TXT 3
XLS WPD FNT 3TF 3TF 0 XLS 3TF OLE OLE OLE 3
ZIP GIF ZIP ZIP GIF 2 ZIP 3TF 3TF 3TF 3TF 0
TOTAL CORRECT: 33 TOTAL CORRECT: 55
TOTAL FILES: 120 TOTAL FILES: 120
Accuracy: 27.50% Accuracy: 45.83%

Figure III.1 Identified type of each test file with a Figure III.2 Identified type of each test file with a
combined OLE DOC fingerprint, BFA algorithm. combined OLE DOC fingerprint, using BFC
algorithm
IV. Conclusions and future work Although FHT performs considerably better than
the other algorithms, 95.83% accuracy, there would
The BFA algorithm proved to be the fastest of be a tradeoff in only using this algorithm. Not all file
the three algorithms. An unknown file takes an types have consistent file headers or trailers and
average of 0.010 seconds to compare to 25 would most likely not be correctly recognized if only
fingerprints and identify the closest match (All times FHT were used. BFA and BFC could help with the
were taken on an 800 MHz Pentium III laptop with identification of the few files FHT was unable to
512 MB RAM ). Because of its poor accuracy, BFA identify. We are working on developing algorithms
would be of a very limited use. The calculations that use a combination of these techniques to improve
type identification accuracy.

0-7695-1874-5/03 $17.00 (C) 2003 IEEE 9


Proceedings of the 36th Hawaii International Conference on System Sciences - 2003

Other improvements could be investigated in the [4] Kyler, Ken, Understanding OLE Documents,
methods used to compute the score for BFA and Delphi Developer’s Journal, September 1998,
BFC. Perhaps more sophisticated curve-matching (or available online from:
other) algorithms could be tested to see if they would http://www.kyler.com/pubs/ddj9894.html
improve the accuracy of these options. Improvements [5] Stallings, William, Cryptography and Network
could, also, be made in computing the score and Security, Prentice Hall, upper Saddle River, New
correlation strength for header and trailer analysis as Jersey, 1999, p. 32.
well. The header and trailer tests both showed [6] The Advanced Missile Signature Center
degradation in performance as longer header and Standard File Format, available online from:
trailer lengths were used. It should be possible to http://fileformat.virtualave.net/archive/saf.zip
modify the scoring algorithm to prevent this [7] To Associate a File Extension with a File Type,
degradation Windows 2000 Professional Documentation,
Overall, the algorithm proved effective at available online from:
correctly identifying the file types of files based http://www.microsoft.com/WINDOWS2000/en/
solely upon the content of the files. FHT algorithm professional/help/win_fcab_reg_filetype.htm
identified executable files with 100 percent accuracy. [8] Why do some scripts start with #!, Chip
This option could therefore be of use to virus Rosenthal, available online from:
scanning packages that are configured to only scan http://baserv/uci/kun.nl/unix-faq.html
executable files. FHT is extremely fast and for [9] /etc/magic Help File, available online from:
header and trailer lengths of five bytes, the total http://qdn.qnx.com/support/docs/qnx4/utils/m/m
fingerprint size for an executable fingerprint would agic.html
be only 53 bytes. The algorithm could possibly be of
use to cryptanalysts as well. It could be used to File 1 File 2 File 3 File 4 Score
3TF 3TF 3TF 3TF 3TF 4
automatically differentiate between real data and ACD OLE OLE OLE OLE 4
“random” encrypted traffic. AVI AVI AVI AVI AVI 4
BMP BMP BMP BMP BMP 4
A number of other systems could also benefit CAT CAT CAT CAT CAT 4
from the described file recognition approach. These CRP CRP CRP CRP CRP 4
DOC OLE OLE OLE OLE 4
include forensic analysis systems, firewalls EXE EXE EXE EXE EXE 4
configured to block transfers of certain file types, FNT FNT FNT FNT RPM 3
GIF GIF GIF GIF GIF 4
web browsers, and security downgrading systems. GZ GZ GZ GZ GZ 4
Further refinements would be required, however, HTML HTML HTML HTML HTML 4
JPG JPG JPG JPG JPG 4
before the algorithm would be fast enough or MDL MDL CAT MDL MDL 3
accurate enough to be used by an operating system MOV MOV MOV MOV MOV 4
MP3 RM MP3 MP3 MP3 3
that must reliably deal with a large number of varied MPEG MPEG MPEG MPEG MPEG 4
file types. PDF PDF PDF PDF PDF 4
PPT OLE OLE OLE OLE 4
PS PS PS PS PS 4
Bibliography RTF
RM
RTF
RM
RTF
RM
RTF
RM
RTF
RM
4
4
RPM RPM RPM RPM RPM 4
TAR TAR TAR TAR TAR 4
[1] Mason McDaniel, Automatic File Type TXT TXT TXT TXT TXT 4
Detection Algorithm, Masters Thesis, James TTF TTF TTF TTF TTF 4
WAV AVI WAV WAV AVI 2
Madison University, 2001. WPD WPD WPD WPD WPD 4
[2] Bellamy, John, Digital Telephony, Second XLS OLE OLE OLE OLE 4
ZIP ZIP ZIP ZIP ZIP 4
Edition, John Wiley & Sons, Inc., New York, TOTAL CORRECT: 115
New York, 1991, pp 110-119. TOTAL FILES: 120
Accuracy: 95.83%
[3] The Binary Structure of OLE Compound
Documents, available online from:
Figure III.3 Identified type of each test file with a
http://user.cs.tu-
combined OLE DOC fingerprint, using FHT
berlin.de/~schwartz/pmh/guide.html
algorithm.

0-7695-1874-5/03 $17.00 (C) 2003 IEEE 10

You might also like