File Formats- Characterization and Validation
File Formats- Characterization and Validation
com
ScienceDirect
IFAC-PapersOnLine 49-29 (2016) 253–258
Abstract: Nowadays, most of the information is stored digitally. Digital information is from a high level
of view it is just an array of bits. In order to figure out its real meaning special software which interprets
it is required. Therefore, if by evolution of technology this software cannot be executed anymore there is
potential risk that also the data interpreted by it becomes not useful. The goal of Digital Preservation is to
stop occurrences of such phenomenon. Data is commonly stored in files each file has a specific format or
structure, by knowing it user can figure out the real meaning of raw data stored in the file as an array of
bits. Digital Preservation considers valid file format as a perquisite for file to be in usable form, with
valid is meant that a specific file is structured conform its declared file format. In this paper we throw a
spotlight on the accuracy and capability of these file validation tests. Therefore, we present some open
source software which are able to automatically identify and verify the file format. We focus more on file
types they can identify, and how they work in large scale data sets.
© 2016, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved.
Keywords: File Format, Digital Preservation, JHove, Droid, Exiftool.
to the fact that software used to interpret different file formats there are still not available file format validation tools which
specially complex file formats have some degree of are giving satisfactory results for most of the file formats
flexibility so they are able to interpret files which do not which are commonly used in digital libraries. As a result,
strictly comply the specification of file format. However, one nowadays exist tools which are very specialized into
thing is guaranteed for sure that if the file is strictly structured identifying, gathering characteristics and verifying some
conform its file format specification the software designed to specific file formats but these tools give bad results for some
open it will definitely open it. But even if it is not strictly other file formats. Therefore, in order to be able to validate
structured conform it's file format specification further many different file formats one should use multiple tools and
analysis should be conducted to see whether the missing parts then merge these results. In the following part of this chapter
from the structure are inside boundaries of the flexibility so we will present three most popular tools for file format
the software can still interpret the file or not. Once a file characterization and verification. They are named JHOVE,
format is verified it is noted as a valid file format. There exist DROID and Exfitools. For each tool a brief description of
a lot of tools which can do the file format verification, in the tools mostly who created and major specifications are
coming chapters we will show the results of an experiment presented.
done with some of them and discuss their performance.
4.1 JHOVE
3. FILE FORMAT REGISTRIES JHove was developed by JSTOR (http://www.jstor.org/) and
In order to validate the format for a given file, the software Harvard university libraries. It was firstly released in 2004.
tool which does this process should have in disposal the As all other file format analysis tools it is meant to be able to
specification of the exact structure of the given file format identify, retrieve significant properties, and verify the validity
and then analyse if the file is built conform it. Therefore, of the file format for a given file. What differs this tool from
there are built databases which consist the specification of other tools and is considered to be its significant
different file formats. Once a tool wants to validate a file characteristic is that it is organized in modular format [2].
format it can query one of these databases to retrieve the Therefore, it is meant to be extensible. As a result, JHOVE
specification of a valid file format for the format in question, has a module for each file format that it can identify, retrieve
then compare it with the actual file and make the decision. In characteristics, and validate. There are some built in modules
addition, file format registries contain additional information which have been developed by JHove developers. This group
which might be used by Digital Preservation but are not in consists modules for the following file formats: TIFF, GIF,
the scope of this paper like relationship between different file JPEG, PNG, JPEG-2000, AIFF, WAV, XML, HTML, UTF8,
formats or different versions of the same file format, this ASCII, PDF, and a generic bytestream. In addition to them
might be useful when a migration between to file formats or there are other modules which are developed by users for
two versions of the same file format is done. Below we will their needs typical examples of this group are modules for
present some of well-known file format registries. MP3, ZIP, and GZIP. Everyone who wants to add an
additional file format to be analysed by JHOVE can develop
PRONOM - Created by The national Archives of United
a new module for that file format and integrate it to JHOVE.
Kingdom, it consists a database of file formats where for each
This makes JHOVE very flexible in terms of the file formats
file format characteristics of it are noted. In addition,
it can analyse, but, on the other hand, it causes problem when
relationship between different file formats is also noted.
the same data set is analysed with different versions of
Since it consist information for a large number of file formats
JHOVE therefore different results are generated [8]. The
the information size for each file format is asap possible [5].
outputs of this tool can be in plain text, or in xml format.
GDFR - Global Digital Format Registry, was developed by
University of Harvard Digital Library which is also one of 4.2 DROID
the pioneers in the field of Digital Preservation. Later on the Digital Record Object Identification is another software tool
build of this format registry also contributed Online which is used only for file format identification. Even though,
Computer Library Centre (OCLC) and the US National the topic of this paper is to analyse tools which do further
Archives and Records Administration (NARA) [6]. analysis on file format like characterization and validation
not just identifying of file format we decided to include
UDFR - Unified Digital Format Registry is another file
DROID in the experiments of this paper because it is
registry created by University of California Digital Library it
considered by digital preservation research community as one
aims to group the information of both previous big file
of the most precise tools for file format analysis, and in
registries PRONOM & GDFR, uses RDF to store its data [7].
addition it was also used by another paper which does similar
4. FILE FORMAT CHARACTERIZATION AND research to this paper whose results we will show in chapter
VALIDATION TOOLS six. [3]. It was developed by the National Archives of United
In addition to file format registries the digital preservation Kingdom who have also developed the file format registry
community has created specialized tools which make use of PRONOM which was explained in chapter three. Therefore,
data in these registries to extract characteristic properties and DROID is meant to work on top of PRONOM it uses the
validate the file format of a specific file. Usually, the teams information about file formats at PRONOM and by using it
that developed the file format registries have also developed conducts different analysis towards file format [4].
tools that use them, but there are also tools that make use of Communication of DROID and PRONOM is done via a file
data in registries and were developed independently from format signature file which is regularly updated from
them. There are plenty of tools and most of them are open PRONOM to DROID. This signature file contains the needed
source so they are freely available with public license. One of characteristics which are used by DROID to identify the file
the most critical weaknesses of digital preservation is that format. Usually these characteristics consist file extensions,
256 Lavdërim Shala et al. / IFAC-PapersOnLine 49-29 (2016) 253–258
magic numbers etc [2]. DROID comes with a graphical user Identification and Validation of file f o r m a t with
interface which lets normal users with no programming skills JHOVE: Fig. 2 r e p resents the summary o f file format
to easily use it. In addition, it has also the command line i d e n t i f i c a t i o n and verification c o n d u c t e d w i t h
interface which is designed for advanced users and let them JHOVE. There h a v e been total 9 2 7 files analysed and
use advanced techniques to automate of processes inside these files are later analysed using Exiftool and DROID.
DROID. The analysis done with DROID, in xml format [9]. As it is noted in the figure JHOVE ma n a g e d to identify
4.3 EXIFTOOL the format of each given file. This stands so because
Exiftool is another software tool which does file format JHOVE assigns file format ”Bytestream” to each file if it
identification, and, furthermore it is also specialized to cannot assign any other specific file format. Moreover,
retrieve significant properties from file formats. It does not unlikely other tools JHOVE also verifies if the file
perform a file format validation. This tool is mainly designed structure matches the specification of the file format
for extracting and modifying metadata from EXIF JHOVE identifies the file to be. As a result, from all
(Exchangeable Image File Format) file format which is analysed files JHOVE claimed that 89% of them where
specialized to store metadata of digital camera and scanners structured according to specification so they had valid file
output. But Exiftool in addition to EXIF file format works format, while for 11% validation failed. W e have
also with a huge variety of file formats, and include most of manually analysed the files which are found not valid from
the popular file formats that are commonly used to store JHOVE and they were all empty files with only filename
information. Therefore in context of digital preservation it is and extension declared most of them HTML files.
used to identify and extract significant properties of different
file formats. It provides a numerous output formatting
options with tab-delimited, HTML, XML and JSON [10].
5. EXPERIMENT
In order to find the capabilities of each of the tools presented
in the previous chapter we have conducted an experiment
with 927 files of different file formats including DOC, GIF,
ZIP, HTML, PNG, PPT, XLS, SWF etc. These files are taken
from a public files library named govdocs1\footnote
{http://digitalcorpora.org/corpora/govdoc} where we took Fig. 2. Summary of file format identification and verification
only first 927 files out of 1 million which library contains. with JHOVE
My experiment consists of two phases. The first phase which
5.2 Experiment with DROID
is presented in this chapter uses each tool presented in
As mentioned in chapter four DROID is not specialized to
previous chapter to conduct an individual analysis against the
do file format characterization, thus it is not able to
set of data mentioned above. Moreover, the second phase
retrieve significant and special properties from different
tries to combine results from each individual tool to an
file formats. But in order to identify the file format DROID
overall result. The goal of this phase is to use as much as
consists a generic module which does the extraction of
possible each individual tool powers and reduce impact of
some parameters which are common for all file formats.
their weaknesses into the result. Therefore, to achieve this we
These parameters are: Extension, Size, LastModified,
have tried different merging strategies which lead to different
Format, MIMEType, and PUID. Therefore, one can say
results which are presented in the next chapter. The following
that DROID characterization capabilities are limited to
part of this chapter describes the results of first phase of
properties mentioned above.
experiment and is organized as follows: For each tool the
results in terms of file format identification, characterization, DROID is thought to be one of the most powerful tools
and validation are presented whenever the tool supports any when it comes to file format identification. We have
of these analysis. conducted my experiment towards 927 files to identify their
file format with droid and the summarized results of this
5.1 Characterization of file format with JHOVE
experiment have been shown in the Fig. 3 below. As it
The table below summarizes the results of characterization
can be inherited from the figure DROID was able to
process using JHOVE. In the table for most popular file
identify the file format for 96.44% of the files. In addition,
formats it is noted which properties the tool extracted.
there are more file formats present compared to the
Table 1. Characterization of file format with JHOVE previous tool JHOVE.
File Format Properties Extracted JHOVE
PNG, DOC, XLS, Last Modified, Size, Format,
PPT TXT, GZ bytestream
HTML, ZIP MIMEtype
Last Modified, Size, Format,
MIMEtype, Version, Profile, PDF PDF-hul
PDF metadata, Image, Fonts etc.
JPG Last Modified, Size, Format, MIME JPEG-hul
type, Version, Profile, JPEG etc.
Last Modified, Size, Format, MIME
GIF type, Version, Profile, GIF metadata GIF-hul
etc.Size, Format,
Last Modified,
XML MIMEtype, Version, XML… XML-hul
6.1 A strategy to merge results from different tools. When working with high scale of data each of the tools
K., B.,[3] used a tool called C3PO (Clever, Crafty, w e have used seemed to have performance issues, some
Content Profiling Tool) which serves as a front end for of them even crashed after a few time and could not give
different digital preservation tools, among others also tools results. Therefore, we had to limit our experiment to a low
used in this paper are supported by C3PO. The strategy number of files (927 files). Furthermore, the paper
of K.,B., [3], beside intersection of results it also considers highlights another problem of file format analysis in digital
the cases shown in Fig. 5 by using which they try to take preservation which is merging of outputs of different
in consideration specialized parts of each tool. By using tools to have better overall results. Here we have thrown a
the rules shown in Fig. 5 on their experiment of 100.000 spotlight on non- standardization of tools, as a result of
files they achieved to lower number of conflicts in terms of which merging and combining the results becomes more
file format from 15529 (16.41% of total files) to 3838 difficult than it should be. One good thing to be noted
(3.98% of total files). In total, they lowered the number of here is that all tools are supporting XML output format
conflicts by five times. Rules of merging results, K.,B., [3]. which makes merging easier. On the other hand, one big
issue encountered here is that tools are using different
notations for file formats e.g. some tools are found to
identify JPEG files format as JPG some others as JPEG, in
order for a software to find that they are talking about the
same format it should be noted the same, in addition, case
sensitivity is not standardized but this issue is solved by
using combining tools that are not case sensitive.
Also the results from different tools differ a lot as a
result, a simple merge of them by finding the intersection
is not found to be a good idea. Therefore, this paper
Fig. 5. Summary of experiment results: Discovered file present the work of another paper where a clever strategy
formats (left), conflict ratio (right) to merge and combine output of different tools is
presented. Finally this paper uses this strategy to merge the
6.2 Merging results with K., B., strategy. outputs of three tools JHOVE, DROID, and Exiftools from
I have also used K., B. strategy to merge results from tests conducted towards the same data set and at the end
chapter four. As presented in Fig. 5 by considering the the results of this merge by highlighting the impact of
intersection of results 358 out of 927 files are found by merging strategy are presented to user. In summary, this
each tool have the same file format. Then we continued paper by evaluating different tools and strategies of
to append this result the results derived by applying of merging their results presents the current situation in file
each rule proposed by K., B. [3]., as a result, at the end format analysis in digital preservation, and highlights the
we have 452 files whose format is identified either by emerge that new more standardized and more stable tools
intersection or by rules of K., B. [3]. need to be developed.
Therefore, these rules in my experiment conducted towards REFERENCES
927 files have lowered the number of conflicts from 569 [1] Abrams, S., at a ll.(2009).“What? So What”:The Next-
(61.38 % of total files) to 475 (51.24 % of total files). In my Generation JHOVE2 Arch. for Format-Aware Charac.
case it has lowered the percentage of conflicts by 10 %, but Journal IJDC4(3), pp.123-136. Edinburgh, UK.
as it is presented in previous section in higher scale this [ 2 ] Ford, K.M. (2011). The Application of File
strategy gives much better results. In Fig. 5 are shown the Identification, Validation, and Characterization
final results of my experiment. Tools in Digital Curation. Univ. of Illinois, USA.
7. SUMMARY [3] Kulmukhametov, A., Becke, C. (2014). Content
profiling for preservation: Improving scale, depth and
This paper gives a general evaluation about digital
quality. The EDL-RP 8839 (1) pp. 1–11., Thailand.
preservation. It informs user what digital preservation is
[4] Lechich, R. (2014) File format identification and
and with what it deals. Furthermore, this paper focuses on
validation tools. (http://www. library.yale.edu/iac/
DPC/FileIDandValidate.pdf) [last accessed 14.06.’14].
file format analysis conducted as a part of digital
preservation. It throws a spotlight on file format
[5] Brown, A. (2005) Pronom 4 information model.
identification, characterization and verification, which are Technical report, The National Archives, UK.
three analysis that digital preservation conducts towards file [6] Goethals, A. (2010). The unified digital formats
formats. First, this paper introduces the reader to the world registry. ISQ 22(2) pp. 26–29
of file formats by keeping him informed what a file format [7] Frisch, P., Heino, N., Tramp, S. (2012). Unified digital
is and how file formats work. format registry (udfr). Univ. of California, USA.
Next, it presents some of the most powerful open source [8] Abrams, S. (2004). The role of format in digital
tools which are designed to do file format analysis. The preservation.VINE 34(2)pp.49–55. Emerald…, USA.
results of each individual tool are presented to the reader. [9] Brown, A. (2005). The droid application
Here w e would like to highlight the fact that beside programming interface. The National Archives, UK.
there are many tools for file format analysis there is no [10]Raymond, R. (2016). Exiftool documentation.
stable tool which covers a high number of formats and http://www.sno.phy.queensu.ca/~phil/exiftool online,
works in an acceptable performance in high scale of data. [last accessed on 14.06.2016].