File Formats- Characterization and Validation

The paper discusses the importance of file formats in digital preservation, emphasizing that valid file formats are essential for maintaining the usability of digital information over time. It outlines the processes of identification, characterization, and verification of file formats, and introduces tools like JHOVE, DROID, and Exiftool that assist in these tasks. The authors highlight the challenges posed by evolving technology and the need for effective validation methods to ensure data remains accessible and interpretable.

Uploaded by

ali.alali.hctc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views6 pages

File Formats- Characterization and Validation

Uploaded by

ali.alali.hctc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Available online at www.sciencedirect.

com

ScienceDirect
IFAC-PapersOnLine 49-29 (2016) 253–258

File Formats - Characterization and Validation

Lavdërim Shala *; Ahmet Shala**
*University of Freiburg / Technical Faculty-Computer Science, Freiburg, DE 79110 Germany
e-mail: [email protected] .
**University of Prishtina / Faculty of Mechanical Engineering, Prishtina, XXK 10000 Kosovo
e-mail: [email protected]

Abstract: Nowadays, most of the information is stored digitally. Digital information is from a high level
of view it is just an array of bits. In order to figure out its real meaning special software which interprets
it is required. Therefore, if by evolution of technology this software cannot be executed anymore there is
potential risk that also the data interpreted by it becomes not useful. The goal of Digital Preservation is to
stop occurrences of such phenomenon. Data is commonly stored in files each file has a specific format or
structure, by knowing it user can figure out the real meaning of raw data stored in the file as an array of
bits. Digital Preservation considers valid file format as a perquisite for file to be in usable form, with
valid is meant that a specific file is structured conform its declared file format. In this paper we throw a
spotlight on the accuracy and capability of these file validation tests. Therefore, we present some open
source software which are able to automatically identify and verify the file format. We focus more on file
types they can identify, and how they work in large scale data sets.
© 2016, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved.
Keywords: File Format, Digital Preservation, JHove, Droid, Exiftool.

1. INTRODUCTION of it is keep the digital information usable over the time,

therefore it tries to make it neutral against continuous
This Evolution of computer systems in the last two decades,
technological evolution. Digital Preservation tries to find
especially the wide usage of them all over the world by
methods, strategies, and activities help it achieve its goals [1].
people of every profile has impacted the way how people
store information. In the time when this paper is written In order to view the content of a digital information (file), the
(2016) people prefer to store their information digitally in user should know which software to use to open the file. On
their electronic devices [1]. The information itself in the the other hand, in order to show the user the real meaning of
electronic storage is saved in objects called files where each the file the software should know how the array of bits which
file consists an array of bits 0 and/or 1. This is done so is inside the file is structured, where in the array specific
independently from the information real meaning no matter information needed by software is noted. In order to solve
whether it is text, photo, audio or something else it is stored these two issues there exists a concept called file format.
digitally as an array of bits. On the other hand, there are built File format defines the structure how information is ordered
computer software which make these arrays of bits in the array of bits stored in the digital storage (disk).
meaningful by converting these bits to the real interpretation Normally, in operating systems the file format is declared as
of them and vice-versa. Without these software the an extension at the end of the file name it is preceded by a
information is meaningless and it is just an array of bits. dot. Therefore, the identification of file format seems to be an
One important characteristic of the world of computers is the easy job. But, in contrast, it is not as easy as it looks since the
continuous evolving and the wide range of computer software format extension can be modified at any time by user.
and hardware. Currently, there is a high number of operating Therefore, from the software point of view, it is not only
systems, and software to manipulate different kind of important to figure out what format does the file have by
information e.g. text being offered to be used by everybody. looking at the extension, but it also important that the file
In addition, this number is rapidly increasing, and, moreover, itself represents a valid structure of the format it is thought to
current software is regularly updated. On the other hand also be otherwise if the file does not represent it, the file is
the hardware implementation and architecture is occurring considered to be not useful [2].
changes towards better performance. In summary, this variety Due to the fact that the goal of the Digital Preservation is to
and evolution of computer systems is leading to multiple keep the file in useful form it also treats the problem
ways of storing and interpreting the digital data. This leads to mentioned above by providing mechanisms which identify
cases like the one mentioned above when the user has still the and verify the format of a file. Once a file is verified by such
information, but is unable to figure out its real meaning. This mechanism it is guaranteed that it is structured in the proper
can be due to the fact that the way this information is way that a software dedicated to open its format can open it.
structured is not supported by the current software and In this paper we put a light on the way how file format is
hardware that he uses which can be an updated successor of treated by Digital Preservation.
the one used to create the data or a completely new one. Here This paper is organized as follows: In the second chapter we
Digital Preservation comes into consideration. The main goal briefly explain the analysis that that Digital Preservation
2405-8963 © 2016, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved.
Peer review under responsibility of International Federation of Automatic Control.
10.1016/j.ifacol.2016.11.062
254 Lavdërim Shala et al. / IFAC-PapersOnLine 49-29 (2016) 253–258

conducts towards file format in order to proof that it is 2.2 Characterization

usable. Next, in the third chapter we introduce file format Characterization is the process of extracting specific
registries which are public databases where file format characteristics from the file. This extracted information is
specification is stored. In the fourth chapter we present a brief used for two main purposes. First it is used by software to
description of three tools named JHOVE, DROID, and check if the file consists the needed information to match the
Exiftool which use the information in databases explained in file format it is thought to be, this is called file format
the previous chapter to identify, retrieve characteristics, and validation and is explained later in this chapter, Second use of
validate file format of files for different file formats whose the extracted data after a validation is proved is to use them
specification is available in these databases. Next in chapters to construct the meaningful interpretation from the file, and
five and six we present my experiment done against a data set present it to user [1]. Extracting specific information from
of 927 files using tools mentioned above. In chapter five for files is computationally expensive since each file should be
each tool individual results towards analysis on a large scale loaded in memory and be analysed. Therefore, to solve this
data set are presented. Next in the sixth chapter we try to problem digital libraries where huge amounts of data is stored
merge the results from each individual tool from previous usually do this extraction only once and then save the
chapter. Here we throw a spotlight on the conflicting results extracted data apart from the original file but in the same
from different tools. Moreover, in this chapter we present a repository so when another extraction is needed this saved
framework proposed by K., B.,[3] for merging results from extracted data is read instead of another extraction from the
different tools including also tools presented in chapter four, original file which is more computationally expensive \cite
this framework tries to solve conflicts at its best. Finally, in {preservation Thesis}.Ford [2] explains characterization by
the last chapter we make a brief discussion about current an example which we will also show below. He takes a well-
situation into file format identification, characterization, and known image format named TIFF which is widely used to
verification, gives suggestions what is important in the future. store image files. The extension for TIFF files is .tif, below in
Fig. 1 the raw presentation of some part of a random TIFF
2. DIGITAL PRESERVATION APPROACH TOWARDS
file where the array of bits is noted in hexadecimal form.
FILE FORMAT
All the work done by digital preservation in order to ensure
that the file is useful in terms of its format can be grouped
into three processes [1]:
* Identification - The process of identifying what format the
given file is likely to be.
* Characterization - The process of extracting specific
information from the file. Fig. 1. Hexadecimal Values of a TIFF file [2]
* Verification - The process of verifying if the file matches
the structure given by specification. The interesting data to be extracted from the file is noted with
2.1 Identification red colour in the figure above. In order to show the image a
Identification as described above has to do with determining TIFF viewer software first validates whether the file he is
what format a file is likely to have. This can be in interest of asked to open is structured conform the TIFF file format. In
human or software, to achieve it there are two approaches. this case the TIFF format specification notes that the first two
The first one relies on fact that when a file is stored in file bytes of the TIFF file note the endianness and can have
system in its name it is also noted the format as an values 49 in case of little endian or 4d in case of big endian.
abbreviation called extension. More precisely extension is the Furthermore, the specification says that the third byte in case
text after the last dot ('.') [4]. Therefore, a file named in the of little endian should have value 00 and the fourth one 2a.
file system as foo.txt is declared to have the file format text. The first four bytes of a TIFF file together make what is
The problem with this approach is that these extensions are called the magic number. By checking the magic number if
not by default verified by system. As a result, a user can store the values, and their position is conform the specification the
a file to disk and nobody stops him from declaring any file software determines the file format. In addition, in the picture
format to this file. Therefore, by just having an extension there is some more data marked red, from the information
declared in file systems it not guaranteed that the content of that they are carrying (Image height, Image width, and
the file matches the structure of the declared file format so a Colour Model) one can conclude that this data is used by the
file named foo.txt is not guaranteed to have some text stored software to interpret the true meaning of the file to user [2].
inside but it is likely to be so. To solve this issue a second 2.3 Verification
approach towards identification is used. This more advanced Verification is the third step that Digital Preservation
method ignores the fact that file format is declared and tries conducts towards a file format. This step tries to figure out
to determine the file format by analysing the structure of the whether a given file is structured in compliance with its file
file, and comparing it if it matches any of the known file format specifications, and whether a software designed to
format specifications [1]. These known file format open this file format will be able to successfully open it. By
specifications are stored in so called File Registries which are successfully is meant to interpret the real meaning of the file.
databases which consist structural specification for different One strict approach might be that the achievement of both
file formats, they are discussed later in this paper. Also in goals of verification is mutual inclusive. Therefore if one
order to analyse the structure it is needed to extract some goal is achieved then the other is also achieved. But in
information from the file this is known as Characterization practice this assumption does not stand, and by not standing
and is described below in this chapter [2]. it creates some trouble towards verification. This comes due
Lavdërim Shala et al. / IFAC-PapersOnLine 49-29 (2016) 253–258 255

to the fact that software used to interpret different file formats there are still not available file format validation tools which
specially complex file formats have some degree of are giving satisfactory results for most of the file formats
flexibility so they are able to interpret files which do not which are commonly used in digital libraries. As a result,
strictly comply the specification of file format. However, one nowadays exist tools which are very specialized into
thing is guaranteed for sure that if the file is strictly structured identifying, gathering characteristics and verifying some
conform its file format specification the software designed to specific file formats but these tools give bad results for some
open it will definitely open it. But even if it is not strictly other file formats. Therefore, in order to be able to validate
structured conform it's file format specification further many different file formats one should use multiple tools and
analysis should be conducted to see whether the missing parts then merge these results. In the following part of this chapter
from the structure are inside boundaries of the flexibility so we will present three most popular tools for file format
the software can still interpret the file or not. Once a file characterization and verification. They are named JHOVE,
format is verified it is noted as a valid file format. There exist DROID and Exfitools. For each tool a brief description of
a lot of tools which can do the file format verification, in the tools mostly who created and major specifications are
coming chapters we will show the results of an experiment presented.
done with some of them and discuss their performance.
4.1 JHOVE
3. FILE FORMAT REGISTRIES JHove was developed by JSTOR (http://www.jstor.org/) and
In order to validate the format for a given file, the software Harvard university libraries. It was firstly released in 2004.
tool which does this process should have in disposal the As all other file format analysis tools it is meant to be able to
specification of the exact structure of the given file format identify, retrieve significant properties, and verify the validity
and then analyse if the file is built conform it. Therefore, of the file format for a given file. What differs this tool from
there are built databases which consist the specification of other tools and is considered to be its significant
different file formats. Once a tool wants to validate a file characteristic is that it is organized in modular format [2].
format it can query one of these databases to retrieve the Therefore, it is meant to be extensible. As a result, JHOVE
specification of a valid file format for the format in question, has a module for each file format that it can identify, retrieve
then compare it with the actual file and make the decision. In characteristics, and validate. There are some built in modules
addition, file format registries contain additional information which have been developed by JHove developers. This group
which might be used by Digital Preservation but are not in consists modules for the following file formats: TIFF, GIF,
the scope of this paper like relationship between different file JPEG, PNG, JPEG-2000, AIFF, WAV, XML, HTML, UTF8,
formats or different versions of the same file format, this ASCII, PDF, and a generic bytestream. In addition to them
might be useful when a migration between to file formats or there are other modules which are developed by users for
two versions of the same file format is done. Below we will their needs typical examples of this group are modules for
present some of well-known file format registries. MP3, ZIP, and GZIP. Everyone who wants to add an
additional file format to be analysed by JHOVE can develop
PRONOM - Created by The national Archives of United
a new module for that file format and integrate it to JHOVE.
Kingdom, it consists a database of file formats where for each
This makes JHOVE very flexible in terms of the file formats
file format characteristics of it are noted. In addition,
it can analyse, but, on the other hand, it causes problem when
relationship between different file formats is also noted.
the same data set is analysed with different versions of
Since it consist information for a large number of file formats
JHOVE therefore different results are generated [8]. The
the information size for each file format is asap possible [5].
outputs of this tool can be in plain text, or in xml format.
GDFR - Global Digital Format Registry, was developed by
University of Harvard Digital Library which is also one of 4.2 DROID
the pioneers in the field of Digital Preservation. Later on the Digital Record Object Identification is another software tool
build of this format registry also contributed Online which is used only for file format identification. Even though,
Computer Library Centre (OCLC) and the US National the topic of this paper is to analyse tools which do further
Archives and Records Administration (NARA) [6]. analysis on file format like characterization and validation
not just identifying of file format we decided to include
UDFR - Unified Digital Format Registry is another file
DROID in the experiments of this paper because it is
registry created by University of California Digital Library it
considered by digital preservation research community as one
aims to group the information of both previous big file
of the most precise tools for file format analysis, and in
registries PRONOM & GDFR, uses RDF to store its data [7].
addition it was also used by another paper which does similar
4. FILE FORMAT CHARACTERIZATION AND research to this paper whose results we will show in chapter
VALIDATION TOOLS six. [3]. It was developed by the National Archives of United
In addition to file format registries the digital preservation Kingdom who have also developed the file format registry
community has created specialized tools which make use of PRONOM which was explained in chapter three. Therefore,
data in these registries to extract characteristic properties and DROID is meant to work on top of PRONOM it uses the
validate the file format of a specific file. Usually, the teams information about file formats at PRONOM and by using it
that developed the file format registries have also developed conducts different analysis towards file format [4].
tools that use them, but there are also tools that make use of Communication of DROID and PRONOM is done via a file
data in registries and were developed independently from format signature file which is regularly updated from
them. There are plenty of tools and most of them are open PRONOM to DROID. This signature file contains the needed
source so they are freely available with public license. One of characteristics which are used by DROID to identify the file
the most critical weaknesses of digital preservation is that format. Usually these characteristics consist file extensions,
256 Lavdërim Shala et al. / IFAC-PapersOnLine 49-29 (2016) 253–258

magic numbers etc [2]. DROID comes with a graphical user Identification and Validation of file f o r m a t with
interface which lets normal users with no programming skills JHOVE: Fig. 2 r e p resents the summary o f file format
to easily use it. In addition, it has also the command line i d e n t i f i c a t i o n and verification c o n d u c t e d w i t h
interface which is designed for advanced users and let them JHOVE. There h a v e been total 9 2 7 files analysed and
use advanced techniques to automate of processes inside these files are later analysed using Exiftool and DROID.
DROID. The analysis done with DROID, in xml format [9]. As it is noted in the figure JHOVE ma n a g e d to identify
4.3 EXIFTOOL the format of each given file. This stands so because
Exiftool is another software tool which does file format JHOVE assigns file format ”Bytestream” to each file if it
identification, and, furthermore it is also specialized to cannot assign any other specific file format. Moreover,
retrieve significant properties from file formats. It does not unlikely other tools JHOVE also verifies if the file
perform a file format validation. This tool is mainly designed structure matches the specification of the file format
for extracting and modifying metadata from EXIF JHOVE identifies the file to be. As a result, from all
(Exchangeable Image File Format) file format which is analysed files JHOVE claimed that 89% of them where
specialized to store metadata of digital camera and scanners structured according to specification so they had valid file
output. But Exiftool in addition to EXIF file format works format, while for 11% validation failed. W e have
also with a huge variety of file formats, and include most of manually analysed the files which are found not valid from
the popular file formats that are commonly used to store JHOVE and they were all empty files with only filename
information. Therefore in context of digital preservation it is and extension declared most of them HTML files.
used to identify and extract significant properties of different
file formats. It provides a numerous output formatting
options with tab-delimited, HTML, XML and JSON [10].
5. EXPERIMENT
In order to find the capabilities of each of the tools presented
in the previous chapter we have conducted an experiment
with 927 files of different file formats including DOC, GIF,
ZIP, HTML, PNG, PPT, XLS, SWF etc. These files are taken
from a public files library named govdocs1\footnote
{http://digitalcorpora.org/corpora/govdoc} where we took Fig. 2. Summary of file format identification and verification
only first 927 files out of 1 million which library contains. with JHOVE
My experiment consists of two phases. The first phase which
5.2 Experiment with DROID
is presented in this chapter uses each tool presented in
As mentioned in chapter four DROID is not specialized to
previous chapter to conduct an individual analysis against the
do file format characterization, thus it is not able to
set of data mentioned above. Moreover, the second phase
retrieve significant and special properties from different
tries to combine results from each individual tool to an
file formats. But in order to identify the file format DROID
overall result. The goal of this phase is to use as much as
consists a generic module which does the extraction of
possible each individual tool powers and reduce impact of
some parameters which are common for all file formats.
their weaknesses into the result. Therefore, to achieve this we
These parameters are: Extension, Size, LastModified,
have tried different merging strategies which lead to different
Format, MIMEType, and PUID. Therefore, one can say
results which are presented in the next chapter. The following
that DROID characterization capabilities are limited to
part of this chapter describes the results of first phase of
properties mentioned above.
experiment and is organized as follows: For each tool the
results in terms of file format identification, characterization, DROID is thought to be one of the most powerful tools
and validation are presented whenever the tool supports any when it comes to file format identification. We have
of these analysis. conducted my experiment towards 927 files to identify their
file format with droid and the summarized results of this
5.1 Characterization of file format with JHOVE
experiment have been shown in the Fig. 3 below. As it
The table below summarizes the results of characterization
can be inherited from the figure DROID was able to
process using JHOVE. In the table for most popular file
identify the file format for 96.44% of the files. In addition,
formats it is noted which properties the tool extracted.
there are more file formats present compared to the
Table 1. Characterization of file format with JHOVE previous tool JHOVE.
File Format Properties Extracted JHOVE
PNG, DOC, XLS, Last Modified, Size, Format,
PPT TXT, GZ bytestream
HTML, ZIP MIMEtype
Last Modified, Size, Format,
MIMEtype, Version, Profile, PDF PDF-hul
PDF metadata, Image, Fonts etc.
JPG Last Modified, Size, Format, MIME JPEG-hul
type, Version, Profile, JPEG etc.
Last Modified, Size, Format, MIME
GIF type, Version, Profile, GIF metadata GIF-hul
etc.Size, Format,
Last Modified,
XML MIMEtype, Version, XML… XML-hul

Fig. 3. Summary of file format identification with DROID

Lavdërim Shala et al. / IFAC-PapersOnLine 49-29 (2016) 253–258 257

5.3 Experiment with Exiftool 6. MERGING OUTPUTS OF DIFFERENT TOOLS

Exiftool is found to be the most specialized tool among three In the previous chapter we presented individual results
tools presented in this paper when it comes to file format from each tool. If one looks at these individual results, and
characterization. This stands so because Exfitool consists of compare them he will find that there are quite big
specialized modules to extract metadata from different file differences between them. First of all some tools like
formats and metadata formats. From the name one can JOHVE are able to note a file format per each file, while
conclude that it is specialized to deal with EXIF metadata, others for a number of files cannot note the file format. On
but, in addition, it is powerful also dealing with other formats the other hand, other tools can identify more file formats,
of metadata like IPTC (International Press as a result, combining of results from each tool will
Telecommunications Council) Information Interchange definitely lead to more accurate results. Based on this fact
Model, XMP (eXtensible Metadata Platform), and ICC on the second phase of my experiment we have merged
Profile (International Colour Consortium). The table below the outputs of JHOVE, DROID, and EXIFTOOL by
summarizes the results of characterization process using using different merging techniques which lead me to an
Exfitool. In the table below for most popular file formats it is overall result which is later presented in this chapter.
noted which properties the tool extracted. Bellow we will chronologically explain how the
Exiftool is considered to be very powerful with file format experiment has been done and what Problems we have
characterization, on the other hand, it deals very good occurred during the experiment. First of all one important
also with file format identification, we have conducted my condition which determines if one is able to combine the
experiment towards 927 files to identify their file format result is that the results should be in the same format, and,
with Exiftool and the summarized results of this experiment in addition their syntax should be the same. In my case
have been shown in Fig. 4 below. each of the tools described in chapter four supported
multiple formats of output. Therefore, we was careful to
Table 2. Summary of file format characterization-Exiftool
use the same output format for all of them, as a result, we
File
Format
Properties extracted have used the XML output format for each tool, and then
PNG FileName, Directory, FileSize, FileModifyDate, we have imported these XML results to a MySQL
JPG MIMEType, Width, Height, ColorType, database from where w e have done the analysis on data.
GIF BitDepth, Compression, Megapixels. What w e saved from each tool in MySQL is a key value
DOC FileName, Directory, FileSize, FileModifyDate, table where key is the file name and value the file format.
XLS MIMEType, CodePage, Title, Subject, Author.
PPT Percentage of conflicts when three tools are considered is
FileName, Directory, FileSize, FileModifyDate,
HTML
MIMEType, Generator, Keywords. 61% and when two tools are considered is 19%. After
TXT Does not recognize this file type!
having output in same format, as noted above there w e
FileName, Directory, FileSize, FileModifyDate,
have encountered another problem, it is the problem of
GZ different notations used by different tools e.g. some tools
MIMEType, Compression
ZIP
Flags, OperatingSystem, ArchivedFileName. noted file formats with lower case letters some others with
PDF
FileName, Directory, FileSize, FileModifyDate, upper case, or DROID noted JPEG file format as JPG while
MIMEType, PDFVersion, PageLayout. others noted it as JPEG, the same problem was noted with
FileName, Directory, FileSize, FileModifyDate, GZIP file format. First w e had to solve these issues before
XML
MIMEType, Width, XML Metadata.
going into merging the data. After having solved issues
mentioned above, w e continued with the actual merging.
In order to merge outputs there should be defined a rule
how the merging is done, first of all w e did a simple
merging where w e took only the intersection of results
from three tools. This strategy seemed to be not a good
one since there were too many conflicts between
evaluations of file formats and only a few of them where
present in the intersection. I n total there are 358 out of
927 files which are found to have the same file format
from each tool and take part in the merging of results.
Another way to merge the results can be to consider all
files for which at least two tools out of three have resulted
the same file format for the given file. We also conducted
Fig. 4. Summary of file format identification with Exiftool this kind of merging strategy and ended with less
conflicts. This time 749 out of 927 files are found to have
As it can be inherited from the figure Exiftool was able the same file format from at least two different tools. Even
to identify the file format for 80.6% of the files. Therefore, though the second strategy gives better results than the
there is a higher number of files whose file format could first one, nobody guarantees that the tools which are
not be identified by Exiftool compared to the previous tool found to have the same evaluation are right and the other
DROID. However, also Exiftool is able to identify higher tool which had different is wrong there can always be the
number of file formats compared to JHOVE. While opposite. Therefore, it is needed to define a strategy
compared to DROID the number of formats identified is which maximizes the utility from each tool. K., B.,[3]
almost the same differs only by one type. developed a strategy to merge results from different tools.
258 Lavdërim Shala et al. / IFAC-PapersOnLine 49-29 (2016) 253–258

6.1 A strategy to merge results from different tools. When working with high scale of data each of the tools
K., B.,[3] used a tool called C3PO (Clever, Crafty, w e have used seemed to have performance issues, some
Content Profiling Tool) which serves as a front end for of them even crashed after a few time and could not give
different digital preservation tools, among others also tools results. Therefore, we had to limit our experiment to a low
used in this paper are supported by C3PO. The strategy number of files (927 files). Furthermore, the paper
of K.,B., [3], beside intersection of results it also considers highlights another problem of file format analysis in digital
the cases shown in Fig. 5 by using which they try to take preservation which is merging of outputs of different
in consideration specialized parts of each tool. By using tools to have better overall results. Here we have thrown a
the rules shown in Fig. 5 on their experiment of 100.000 spotlight on non- standardization of tools, as a result of
files they achieved to lower number of conflicts in terms of which merging and combining the results becomes more
file format from 15529 (16.41% of total files) to 3838 difficult than it should be. One good thing to be noted
(3.98% of total files). In total, they lowered the number of here is that all tools are supporting XML output format
conflicts by five times. Rules of merging results, K.,B., [3]. which makes merging easier. On the other hand, one big
issue encountered here is that tools are using different
notations for file formats e.g. some tools are found to
identify JPEG files format as JPG some others as JPEG, in
order for a software to find that they are talking about the
same format it should be noted the same, in addition, case
sensitivity is not standardized but this issue is solved by
using combining tools that are not case sensitive.
Also the results from different tools differ a lot as a
result, a simple merge of them by finding the intersection
is not found to be a good idea. Therefore, this paper
Fig. 5. Summary of experiment results: Discovered file present the work of another paper where a clever strategy
formats (left), conflict ratio (right) to merge and combine output of different tools is
presented. Finally this paper uses this strategy to merge the
6.2 Merging results with K., B., strategy. outputs of three tools JHOVE, DROID, and Exiftools from
I have also used K., B. strategy to merge results from tests conducted towards the same data set and at the end
chapter four. As presented in Fig. 5 by considering the the results of this merge by highlighting the impact of
intersection of results 358 out of 927 files are found by merging strategy are presented to user. In summary, this
each tool have the same file format. Then we continued paper by evaluating different tools and strategies of
to append this result the results derived by applying of merging their results presents the current situation in file
each rule proposed by K., B. [3]., as a result, at the end format analysis in digital preservation, and highlights the
we have 452 files whose format is identified either by emerge that new more standardized and more stable tools
intersection or by rules of K., B. [3]. need to be developed.
Therefore, these rules in my experiment conducted towards REFERENCES
927 files have lowered the number of conflicts from 569 [1] Abrams, S., at a ll.(2009).“What? So What”:The Next-
(61.38 % of total files) to 475 (51.24 % of total files). In my Generation JHOVE2 Arch. for Format-Aware Charac.
case it has lowered the percentage of conflicts by 10 %, but Journal IJDC4(3), pp.123-136. Edinburgh, UK.
as it is presented in previous section in higher scale this [ 2 ] Ford, K.M. (2011). The Application of File
strategy gives much better results. In Fig. 5 are shown the Identification, Validation, and Characterization
final results of my experiment. Tools in Digital Curation. Univ. of Illinois, USA.
7. SUMMARY [3] Kulmukhametov, A., Becke, C. (2014). Content
profiling for preservation: Improving scale, depth and
This paper gives a general evaluation about digital
quality. The EDL-RP 8839 (1) pp. 1–11., Thailand.
preservation. It informs user what digital preservation is
[4] Lechich, R. (2014) File format identification and
and with what it deals. Furthermore, this paper focuses on
validation tools. (http://www. library.yale.edu/iac/
DPC/FileIDandValidate.pdf) [last accessed 14.06.’14].
file format analysis conducted as a part of digital
preservation. It throws a spotlight on file format
[5] Brown, A. (2005) Pronom 4 information model.
identification, characterization and verification, which are Technical report, The National Archives, UK.
three analysis that digital preservation conducts towards file [6] Goethals, A. (2010). The unified digital formats
formats. First, this paper introduces the reader to the world registry. ISQ 22(2) pp. 26–29
of file formats by keeping him informed what a file format [7] Frisch, P., Heino, N., Tramp, S. (2012). Unified digital
is and how file formats work. format registry (udfr). Univ. of California, USA.
Next, it presents some of the most powerful open source [8] Abrams, S. (2004). The role of format in digital
tools which are designed to do file format analysis. The preservation.VINE 34(2)pp.49–55. Emerald…, USA.
results of each individual tool are presented to the reader. [9] Brown, A. (2005). The droid application
Here w e would like to highlight the fact that beside programming interface. The National Archives, UK.
there are many tools for file format analysis there is no [10]Raymond, R. (2016). Exiftool documentation.
stable tool which covers a high number of formats and http://www.sno.phy.queensu.ca/~phil/exiftool online,
works in an acceptable performance in high scale of data. [last accessed on 14.06.2016].

ICDL Computer Essentials
From Everand
ICDL Computer Essentials
Michael Anderson
4/5 (2)
Computer Storage Fundamentals: Storage system, storage networking and host connectivity
From Everand
Computer Storage Fundamentals: Storage system, storage networking and host connectivity
Susanta Dutta
No ratings yet
C# for Beginners: Learn in 24 Hours
From Everand
C# for Beginners: Learn in 24 Hours
Alex Nordeen
No ratings yet
C++ File Handling Step by Step: A Practical Guide with Examples
From Everand
C++ File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Python File Handling Made Easy: A Practical Guide with Examples
From Everand
Python File Handling Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
Config File Types
From Everand
Config File Types
Frank Wellington
No ratings yet
Basic Principles of an Operating System: Learn the Internals and Design Principles
From Everand
Basic Principles of an Operating System: Learn the Internals and Design Principles
Priyanka Rathee
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Operating Systems: Concepts to Save Money, Time, and Frustration
From Everand
Operating Systems: Concepts to Save Money, Time, and Frustration
Jonathan Rigdon
No ratings yet
DPNote4_FileNamingAndFormats
No ratings yet
DPNote4_FileNamingAndFormats
2 pages
Beginner's Guide for Cybercrime Investigators
From Everand
Beginner's Guide for Cybercrime Investigators
Nicolae Sfetcu
5/5 (1)
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
5 pages
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
Best Free Open Source Data Recovery Apps for Mac OS English Edition
From Everand
Best Free Open Source Data Recovery Apps for Mac OS English Edition
Cyber Jannah Sakura
No ratings yet
Expert Linux Administration Guide: Administer and Control Linux Filesystems, Networking, Web Server, Virtualization, Databases, and Process Control (English Edition)
From Everand
Expert Linux Administration Guide: Administer and Control Linux Filesystems, Networking, Web Server, Virtualization, Databases, and Process Control (English Edition)
Vishal Rai
3/5 (2)
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Computational Intelligence To Aid Text F
No ratings yet
Computational Intelligence To Aid Text F
14 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Java File Handling Step by Step: A Practical Guide with Examples
From Everand
Java File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Document and Knowledge Management Interrelationships
From Everand
Document and Knowledge Management Interrelationships
A. Afritopic
4.5/5 (2)
“Information Systems Unraveled: Exploring the Core Concepts”: GoodMan, #1
From Everand
“Information Systems Unraveled: Exploring the Core Concepts”: GoodMan, #1
Patrick Mukosha
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Operating System Interview Questions and Answers
From Everand
Operating System Interview Questions and Answers
Manish Soni
No ratings yet
Audio Visual Speech Recognition: Advancements, Applications, and Insights
From Everand
Audio Visual Speech Recognition: Advancements, Applications, and Insights
Fouad Sabry
No ratings yet
Data Format Compare
From Everand
Data Format Compare
Frank Wellington
No ratings yet
What Is Digital Preservation?
No ratings yet
What Is Digital Preservation?
2 pages
Python Data Persistence
From Everand
Python Data Persistence
Malhar Lathkar
No ratings yet
The Science of Managing Our Digital Stuff
From Everand
The Science of Managing Our Digital Stuff
Ofer Bergman
3.5/5 (3)
Programming Filesystems with FUSE: Definitive Reference for Developers and Engineers
From Everand
Programming Filesystems with FUSE: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Quick Guide to Cloud Computing and Cyber Security
From Everand
The Quick Guide to Cloud Computing and Cyber Security
Marcia R.T. Pistorious
4/5 (11)
INI Format Explained
From Everand
INI Format Explained
Isabella Ramirez
No ratings yet
Building an Operating System with Rust: A Practical Guide
From Everand
Building an Operating System with Rust: A Practical Guide
Robert Johnson
No ratings yet
Mastering System Programming with C: Files, Processes, and IPC
From Everand
Mastering System Programming with C: Files, Processes, and IPC
Larry Jones
No ratings yet
Linux 5 Day Introduction Course
From Everand
Linux 5 Day Introduction Course
Stephen Edwards
No ratings yet
Efficient Workflows with Notepad++: Definitive Reference for Developers and Engineers
From Everand
Efficient Workflows with Notepad++: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Compression: Unlocking Efficiency in Computer Vision with Data Compression
From Everand
Data Compression: Unlocking Efficiency in Computer Vision with Data Compression
Fouad Sabry
No ratings yet
Advanced Fuse Implementation: Definitive Reference for Developers and Engineers
From Everand
Advanced Fuse Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Network File System in Practice: Definitive Reference for Developers and Engineers
From Everand
Network File System in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
See
No ratings yet
See
4 pages
JavaScript File Handling from Scratch: A Practical Guide with Examples
From Everand
JavaScript File Handling from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
Front Matter
No ratings yet
Front Matter
22 pages
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
Digital Collections: Preservation and Problems: 3 4 0 International CALIBER-2008
No ratings yet
Digital Collections: Preservation and Problems: 3 4 0 International CALIBER-2008
9 pages
Touchpad Play Ver 2.0 Class 7: Windows 10 & MS Office 2016
From Everand
Touchpad Play Ver 2.0 Class 7: Windows 10 & MS Office 2016
Team Orange
No ratings yet
Pages From TJDBD
No ratings yet
Pages From TJDBD
6 pages
Keeping Our Biits About U NUNUNG
No ratings yet
Keeping Our Biits About U NUNUNG
5 pages
Steps to Technology: Terms and Concepts For Beginners
From Everand
Steps to Technology: Terms and Concepts For Beginners
Ahmed Mosalam
No ratings yet
USB Mass Storage: Designing and Programming Devices and Embedded Hosts
From Everand
USB Mass Storage: Designing and Programming Devices and Embedded Hosts
Jan Axelson
No ratings yet
Ensuring The Longevity of Digital Information
No ratings yet
Ensuring The Longevity of Digital Information
19 pages
Edge Cloud Operations: A Systems Approach
From Everand
Edge Cloud Operations: A Systems Approach
Larry L Peterson
No ratings yet
Software Design And Development in your pocket
From Everand
Software Design And Development in your pocket
David Chen
5/5 (1)
Data-Driven Security: Analysis, Visualization and Dashboards
From Everand
Data-Driven Security: Analysis, Visualization and Dashboards
Jay Jacobs
No ratings yet
Digital Preservation: Handbook Getting Started
No ratings yet
Digital Preservation: Handbook Getting Started
11 pages
PHELPS, Thomas A. - A No-Compromises Architecture For Digital Document Preservation
No ratings yet
PHELPS, Thomas A. - A No-Compromises Architecture For Digital Document Preservation
12 pages
Infrastructure as Code with OpenTofu: A perfect Terraform alternative to manage compute, storage, networking and other infrastructure resources
From Everand
Infrastructure as Code with OpenTofu: A perfect Terraform alternative to manage compute, storage, networking and other infrastructure resources
Tyran Vosk
No ratings yet
Addax Presentation
No ratings yet
Addax Presentation
31 pages
Addax_open_EN_01
No ratings yet
Addax_open_EN_01
4 pages
Addax_UK_4603
No ratings yet
Addax_UK_4603
4 pages
11Mar_Mayer
No ratings yet
11Mar_Mayer
106 pages
Document (1)
No ratings yet
Document (1)
4 pages
Fileprints Identifying File Types by n-gram Analysis
No ratings yet
Fileprints Identifying File Types by n-gram Analysis
8 pages
Jul 07 TRNG Resume Anita Albert
No ratings yet
Jul 07 TRNG Resume Anita Albert
6 pages
SS2 ICT
No ratings yet
SS2 ICT
25 pages
Quiz - Authentication and Access Control_ Attempt review
No ratings yet
Quiz - Authentication and Access Control_ Attempt review
4 pages
Number of Approaches To Develop Python Programs
No ratings yet
Number of Approaches To Develop Python Programs
2 pages
Codigos de Falla 2
No ratings yet
Codigos de Falla 2
31 pages
Ch 4 Plotting Data Using Mathplotlib 2024-25
No ratings yet
Ch 4 Plotting Data Using Mathplotlib 2024-25
29 pages
Xi Cs Hye Jpr Qp s1
No ratings yet
Xi Cs Hye Jpr Qp s1
6 pages
Suncor-Ewm-Pp-2024-06-18 06 - 41 - 23.336+0000
No ratings yet
Suncor-Ewm-Pp-2024-06-18 06 - 41 - 23.336+0000
71 pages
Vol.2 No.10
No ratings yet
Vol.2 No.10
222 pages
Practical No 9
No ratings yet
Practical No 9
4 pages
DC204R e
No ratings yet
DC204R e
6 pages
Ahmed Abouzeid Mahmoud: Professional Summary
100% (1)
Ahmed Abouzeid Mahmoud: Professional Summary
4 pages
Erp & CRM
No ratings yet
Erp & CRM
26 pages
Method Statement For Configuration of The TTP Substations in Scada in The RCC
0% (1)
Method Statement For Configuration of The TTP Substations in Scada in The RCC
27 pages
Topic4 - Python Interpreter Modes
No ratings yet
Topic4 - Python Interpreter Modes
8 pages
SoloLearn (HTML)
100% (1)
SoloLearn (HTML)
22 pages
Interfaces
No ratings yet
Interfaces
9 pages
Example Usage:: You Said
No ratings yet
Example Usage:: You Said
15 pages
CP 470 - Weekly Schedule
No ratings yet
CP 470 - Weekly Schedule
4 pages
Fall 2014 Results
No ratings yet
Fall 2014 Results
167 pages
Confirm Subscription Details - Seastar Demodulators and Kongsberg DPS
No ratings yet
Confirm Subscription Details - Seastar Demodulators and Kongsberg DPS
3 pages
Diagnostic Codes: Shutdown SIS
No ratings yet
Diagnostic Codes: Shutdown SIS
3 pages
Cryptoeconomics Data Applicationfor Token Sales Analysis
No ratings yet
Cryptoeconomics Data Applicationfor Token Sales Analysis
14 pages
Runo Version 5 - Release Notes
No ratings yet
Runo Version 5 - Release Notes
21 pages
r12 Insert Delete Pricelist Line
No ratings yet
r12 Insert Delete Pricelist Line
6 pages
2024 02 12 EXTENDED Eventide Nexlog Brochure Manufacturer
No ratings yet
2024 02 12 EXTENDED Eventide Nexlog Brochure Manufacturer
28 pages
Price Books in R12: NCOAUG August 21, 2009 Michele Whitlock, Senior Consultant
No ratings yet
Price Books in R12: NCOAUG August 21, 2009 Michele Whitlock, Senior Consultant
30 pages
WinDbg CheatSheet
No ratings yet
WinDbg CheatSheet
1 page
Time History Analysis SAP2000
No ratings yet
Time History Analysis SAP2000
35 pages
Comparing Matlab To Excel/VBA: Jake Blanchard University of Wisconsin - Madison August 2007
No ratings yet
Comparing Matlab To Excel/VBA: Jake Blanchard University of Wisconsin - Madison August 2007
29 pages

File Formats- Characterization and Validation

Uploaded by

File Formats- Characterization and Validation

Uploaded by

Available online at www.sciencedirect.

File Formats - Characterization and Validation

1. INTRODUCTION of it is keep the digital information usable over the time,

conducts towards file format in order to proof that it is 2.2 Characterization

Fig. 3. Summary of file format identification with DROID

5.3 Experiment with Exiftool 6. MERGING OUTPUTS OF DIFFERENT TOOLS

You might also like