Copy On Write Based File Systems Performance Analysis and Implementation
Copy On Write Based File Systems Performance Analysis and Implementation
Sakis Kasampalis
In this work I am focusing on Copy On Write based file systems. Copy On Write
is used on modern file systems for providing (1) metadata and data consistency
using transactional semantics, (2) cheap and instant backups using snapshots
and clones.
This thesis is divided into two main parts. The first part focuses on the design
and performance of Copy On Write based file systems. Recent efforts aiming at
creating a Copy On Write based file system are ZFS, Btrfs, ext3cow, Hammer,
and LLFS. My work focuses only on ZFS and Btrfs, since they support the most
advanced features. The main goals of ZFS and Btrfs are to offer a scalable, fault
tolerant, and easy to administrate file system. I evaluate the performance and
scalability of ZFS and Btrfs. The evaluation includes studying their design
and testing their performance and scalability against a set of recommended file
system benchmarks.
Most computers are already based on multi-core and multiple processor architec-
tures. Because of that, the need for using concurrent programming models has
increased. Transactions can be very helpful for supporting concurrent program-
ming models, which ensure that system updates are consistent. Unfortunately,
the majority of operating systems and file systems either do not support trans-
actions at all, or they simply do not expose them to the users. The second
part of this thesis focuses on using Copy On Write as a basis for designing a
transaction oriented file system call interface.
ii
Preface
This thesis was prepared at the Department of Informatics, the Technical Uni-
versity of Denmark in partial fulfilment of the requirements for acquiring the
M.Sc. degree in engineering.
The thesis deals with the design and performance of modern Copy On Write
based file systems, and the design of a transactional file system call interface
using Copy On Write semantics. The main focus is on the higher interface of the
Virtual File System to user processes. A minimal transaction oriented system
call interface is proposed, to be used as the Virtual File System of a research
operating system.
Sakis Kasampalis
iv
Acknowledgements
I want to thank my supervisor Sven Karlsson for his help and support regarding
many issues related with this thesis. Only a few to mention are finding a relevant
topic which fits to my current skills, suggesting research papers and technical
books, giving useful feedback, and providing technical assistance.
I also thank my friend Stavros Passas, currently a PhD student at DTU In-
formatics, for the research papers he provided me concerning operating system
reliability and performance, and transactional memory. Among other things,
Stavros also helped me with writing the thesis proposal and solving several
technical problems.
Special thanks to Andreas Hindborg and Kristoffer Andersen, the BSc students
who have started working on FenixOS for the needs of their thesis before me.
Andreas and Kristoffer helped me to get familiar with the development tools of
FenixOS very fast.
Last but not least, I would like to thank Per Friis, one of the department’s
system administrators, for providing me the necessary hardware required to do
the file system performance analysis.
vi
Contents
Abstract i
Preface iii
Acknowledgements v
1 Introduction 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Major results and conclusions . . . . . . . . . . . . . . . . . . . . 2
1.4 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 7
2.1 Operating systems . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 File systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7 Related work 65
7.1 Performance analysis and scalability tests . . . . . . . . . . . . . 65
7.2 Transaction oriented system call interfaces . . . . . . . . . . . . . 66
8 Conclusions 69
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 1
Introduction
The motivation behind this thesis is two folded. The first part of the thesis
focuses on testing the performance and scalability of two modern Copy On
Write (COW) based file systems: the Zettabyte File System (ZFS) and the B-
tree file system or “Butter-eff-ess” (Btrfs). COW based file systems can solve
the problems that journaling file systems are facing efficiently. Journaling file
systems (1) cannot protect both data and metadata, (2) cannot satisfy the large
scaling needs of modern storage systems, (3) are hard to administrate. Recent
file systems use COW to address the problems of fault tolerance, scalability,
and easy administration. The main reason for doing the performance analysis
of ZFS and Btrfs is that we are investigating which file system is the best
option to use in our UNIX-like research operating system called FenixOS. We
are aware that there are more COW based file systems but we have decided
to choose one of ZFS and Btrfs for several reasons. ZFS is already stable,
has a very innovative architecture, and introduces many interesting features
that no other file system ever had. Btrfs on the other hand is under heavy
development, targets on simplicity and performance, and is supported by the
large GNU/Linux community.
The second part of the thesis deals with the design of a transaction oriented
file system call interface for FenixOS. A transaction is a sequence of operations
described by four properties: Atomicity, Isolation, Consistency, and Durability
(ACID). Transactions are essential for using concurrent programming models
2 Introduction
In a sentence, the problem statement of the thesis is to (1) identify the strengths
and weaknesses of ZFS and Btrfs and propose one of them as the best choice for
the needs of FenixOS, (2) define a transaction oriented file system call interface
for FenixOS, and document the design decisions related with it.
1.2 Contributions
The contributions of this thesis include (1) an evaluation of the performance and
scalability of ZFS and Btrfs under several workloads and different scenarios using
recommended benchmarks, (2) the definition of the file system call interface of
FenixOS and a discussion about the design decisions related with it.
The performance analysis results of the thesis show that at the moment ZFS
seems to be the right choice for FenixOS. The main reasons are that (1) ZFS
uses full ACID semantics to offer data and metadata consistency while Btrfs
uses a more relaxed approach which violates transactional semantics, (2) ZFS
can scale more than Btrfs.
Since I have not implemented the file system call interface, I cannot make con-
clusions about its performance. What I can do instead is an evaluation of the
design decisions that I have taken against the design of other transaction ori-
ented interfaces.
Cheap and instant snapshots and clones offer online backups and testing en-
vironments to the users. When COW is not used, snapshots and clones are
expensive because non COW based file systems overwrite data in place. That
is why they are usually not supported in file systems which do not use COW.
Network File System (NFS) Access Control Lists (ACLs) are used as the only
file permission model. NFS ACLs are (1) standard, in opposite to “POSIX”
ACLs, (2) more fine-grained than traditional UNIX mode style file attributes
and “POSIX” ACLs. Furthermore, FenixOS does not need to be compatible
with “POSIX” ACLs or traditional UNIX file attributes, thus there are no
backward compatibility, performance, and semantic issues which occur when
conversions from one format to another and synchronisations are required. A
disadvantage of using only NFS ACLs is that they are more complex compared
with traditional UNIX mode style file attributes. Users who are not familiar
with ACLs will need some time to get used at creating and modifying NFS
ACLs. However, users who are already familiar with “POSIX” ACLs will not
have any real problems.
Memory mapped resources are a good replacement of the legacy POSIX read
and write system calls since they (1) are optimised when used with a unified
cache manager, a feature planned to be supported in FenixOS, (2) can be used in
a shared memory model for reducing Interprocess Communication (IPC) costs.
Since our interface is not POSIX compatible (1) it will take some time for
POSIX programmers to get used to it, (2) POSIX code becomes inapplicable for
FenixOS. In our defence, we believe that the advantages of exposing transactions
to programmers using a new interface are more than the disadvantages. File
system call interfaces which add transaction support without modifying POSIX
end up being slow (user space solutions) or limited: transactions are not exposed
to programmers. Interfaces which modify POSIX to support transactions end
up being complex to be used (logging approaches) or complex to be developed:
transaction support is added to existing complex non transaction oriented kernel
code. We believe that it is better to design a transaction oriented system call
interface from scratch, instead of modifying an existing non transaction oriented
system call interface. To conclude, supporting POSIX in FenixOS is not a
problem. If we decide that it is necessary to support POSIX, we can create an
emulation layer like many other operating systems have already done. Examples
include the ZFS POSIX Layer (ZPL), the ANSI1 POSIX Environment (APE),
1 American National Standards Institute
1.4 Outline of the thesis 5
and Interix.
Chapter 3 - “Copy On Write Based File Systems” describes the problems that
current file systems have, and how they can be solved using Copy On Write.
Chapter 5 - “A file system call interface for FenixOS” includes a discussion about
the usefulness of transactions in operating systems and file systems, and presents
my contributions to the design of the file system call interface of FenixOS.
Chapter 7 - “Related work” presents the work of other people regarding the per-
formance of ZFS and Btrfs and the design of transactional system call interfaces,
and discusses how they relate with this work.
This chapter provides definitions for all the important technical terms which
will be extensively used during my thesis: operating system, kernel, file system,
virtual file system, and more. Apart from that, it (1) describes the most common
ways modern operating systems can be structured, (2) introduces you to the
research operating system FenixOS, and (3) includes references to several well-
known and less known operating systems and file systems.
It is not easy to provide a single clear definition about what an operating system
is. Personally, I like the two roles described in [120, chap. 1]. One role of the
software layer which is called the operating system is to manage the hardware
resources of a machine efficiently. By machine I mean a personal computer, a
mobile phone, a portable media player, a car, a television, etc. Examples of
hardware resources are processors, hard disks, printers, etc. The second role of
the operating system is to hide the complexity of the hardware by providing
simple abstractions to the users.
which is independent of the device the data are saved. They do not need to
know the details about how Universal Serial Bus (USB) sticks, hard disks, or
magnetic tapes work. The only fundamental operations users ought to learn are
how to create, edit, and save files using their favourite text editor. The same
operations apply for all devices, magnetic or electronic, local or remote.
Figure 2.1 shows the placement of the operating system in relation to the rest
layers of a personal computer. The operating system is on top of the hard-
ware, managing the resources and providing its abstractions to the applications.
Users interact with the operating system through those abstractions, using their
favourite applications.
User
Application
Operating System
Hardware
Figure 2.1: The placement of the operating system. The operating system is placed
between the hardware and the application layers. It manages the hard-
ware resources efficiently, and provides intuitive abstractions to the user
applications
For most operating systems, a service can run either in kernel/supervisor mod-
e/space or in user mode/space. The services which run in kernel mode have
2.1 Operating systems 9
full control of the hardware, which means that a potential bug can cause se-
vere damage to the whole system: A system crash, critical data corruption,
etc. The services which run in user mode have restricted access to the system
resources, which makes them less dangerous [121]. For example, if the printer
driver crashes while running in user mode, the worst case will be the printer to
malfunction. If the same thing happens while the printer driver is running in
kernel mode, the results can be much more harmful.
Not all operating systems need to support both kernel and user mode. Personal
computer operating systems, such as GNU/Linux [116, chap. 5], Windows,
and Mac OS X offer both modes. Embedded operating systems on the other
hand, may not have kernel mode. Examples of embedded operating systems are
QNX, eCos, and Symbian OS. Interpreted systems, such as Java based smart
card operating systems use software interpretation to separate the two modes.
When Java is used for instance, the separation happens through the Java Virtual
Machine (JVM) [120, chap. 1]. Examples of interpreted systems are Java Card
and MULTOS.
The heart of an operating system is its kernel. For operating systems which
support two modes, everything inside the kernel is running in supervisor mode,
while everything outside it in user mode. Examples of privileged operations
which run in supervisor mode are process creation and Input/Output (I/O)
operations (writing in a file, asking from the printer to print a document, etc.).
Kernels can be very big or extremely small in size, depending on the services
which run inside and outside them, respectively. Big is in terms of million lines
of code, while small is in terms of thousand lines of code. Up to date, the two
most popular commercial operating system structures follow either a monolithic
or a microkernel approach [120, chap. 1].
In pure monolithic kernels all system services run in kernel mode. The system is
consisted of hundreds or thousands procedures, which can call each other with-
out any restriction [122, chap. 1]. The main advantage of monolithic systems
is their high performance. Their disadvantages are that they are insecure, they
have poor fault isolation, and low reliability [52, 118, 121]. A typical monolithic
organisation is shown in Figure 2.2. Without getting into details about which is
the role of each service, we can see that all services run in kernel mode. There
is no isolation between the services, which means that a bug in one service can
cause troubles not only to itself, but to the whole system. A scenario where
a service faces a crash and as a result overwrites important data structures of
another service can be disastrous. Examples of monolithic kernels are Linux,
10 Background
user Application
mode
mode
Scheduler, Virtual Memory
Hardware
Figure 2.2: A typical monolithic design. On top of the hardware is the operating
system. All services (file system, virtual memory, device drivers, etc.)
run in kernel mode. Applications run in user mode and execute system
calls for switching to kernel mode when they want to perform a privileged
operation
Microkernels follow the philosophy that only a small number of services should
execute in kernel mode. All the rest are moved to user space. Moving services
to user space does not decrease the actual number of service bugs, but it makes
them less powerful. Every user space service is fully isolated from the rest, and
can access kernel data structures only by issuing special kernel/system calls.
The system calls are checked for validity before they are executed. Figure 2.3
demonstrates an example of a microkernel organisation. In the organisation
shown in Figure 2.3, only a minimal number of services runs in kernel space.
All the rest have been moved to user space. When following a microkernel
structure, it is up to the designer to decide which parts will live inside the kernel
2.1 Operating systems 11
and which will be moved outside. Of course, moving services to user space and
providing full isolation between them does not come without costs. The way
the various user mode services can communicate with each other and the kernel
is called Interprocess Communication (IPC). While in monolithic systems IPC
is easy and cheap, in microkernels it gets more complex and is more expensive.
The extra overhead that IPC introduces in microkernel based systems is their
biggest disadvantage. The benefits of microkernels are their enhanced reliability
and fault isolation [52, 118, 121]. Examples of microkernel based systems are
MINIX 3, Coyotos, and Singularity.
Application
user
mode Application UNIX Device File
IPC Server Driver Server
kernel
Basic IPC, Virtual Memory, Scheduling
mode
Hardware
Figure 2.3: A microkernel design. A minimal number of services (basic IPC, virtual
memory, and scheduling) runs in kernel mode. All the rest services (file
server, device drivers, etc.) execute in user mode. Applications execute
services using IPC. The services communicate with each other and the
kernel also using IPC
There is a long debate which holds for years about whether modern operating
systems should follow a monolithic or a microkernel approach [119]. Researchers
12 Background
have also proposed many alternative structures [118, 121]. Personally, I like
Tanenbaum’s idea about creating self-healing systems. And I would not mind
to sacrifice a small performance percentage for the sake of reliability and secu-
rity. However, I am not sure if the performance loss of modern microkernels is
acceptable. The developers of MINIX 3 argue that the extra overhead of their
system compared to a monolithic system is less than 10%. Yet, MINIX 3 does
not currently support multiple processors. Since modern personal computer
architectures are based on multi-core processors, operating systems must add
support for multiple processors before we can make any conclusions about their
performance.
2.1.2 FenixOS
When a program is running in user mode and requires some service from the
operating system, it must execute a system call. Through the system call,
the program switches from user mode to kernel mode. Only in kernel mode the
program is permitted to access the required operating system service [120, chap.
1].
Back in the 70s and the 80s, when the code of UNIX became publicly available,
many different distributions (also called flavours) started to appear. Some ex-
amples are BSD, SunOS, and SCO UNIX. Unfortunately, the incompatibilities
between the different UNIX flavours made it impossible for a programmer to
write a single program that could run on all of them without problems. To solve
this issue the Institute of Electrical and Electronics Engineers (IEEE) defined
POSIX. POSIX is a minimal system call interface that all UNIX-like systems
should support, as long as they want to minimise the incompatibilities between
them. The systems which support the POSIX interface are called POSIX com-
patible, POSIX compliant, or POSIX conformant (all names apply) [122].
To conclude this section, I should mention that apart from fault tolerance,
reliability, and stability, FenixOS also focuses on scalability. As a proof, FenixOS
already supports multiple processors. And we are not investigating COW based
file systems with no reason. It is important that scalability is one of the main
features that COW based file systems are focusing.
Application
user
mode File
System
Device
Driver
Driver
VFS
Wrapper
Dispatcher, ...
Hardware
Figure 2.4: The structure of FenixOS. Buggy components like device drivers run in
user space. Applications, which also run in user space never contact
directly with those components. They execute system calls for asking a
service from the kernel. Services like driver wrappers allow the system
to monitor and restrict the activities of the device drivers
2.2 File systems 15
or simply file system [115, chap. 12], [120, chap. 4]. Examples of common file
systems are the New Technology File System (NTFS), the fourth extended file
system (ext4), the File Allocation Table (FAT32), and the UNIX File System
(UFS).
An operating system usually supports many file systems. This is required be-
cause users can share data only if the operating systems that they use support
a common file system. For instance, it is not unusual for a GNU/Linux system
to use ext4 as its primary file system, but also support FAT32. In this way it
allows GNU/Linux and Windows users to exchange files.
The need of an operating system to support multiple file systems has lead to
the concept of the Virtual node (Vnode) layer [63] or Virtual File System (VFS)
layer. The VFS is a software layer which provides a single, uniform interface
to multiple file systems. The main idea of the VFS is to abstract the common
functionality of all file systems on a high level, and leave the specific implemen-
tation to the actual file systems. This means that when file system developers
wish to add a new file system to an operating system which uses a VFS, they
must implement all the functions (more precisely system calls) required by the
VFS [22, chap. 12], [47, chap. 10], [115, chap. 12], [120, chap. 4].
File systems in FenixOS will run in user space, while the VFS will run in kernel
space. This is demonstrated in Figure 2.4. The VFS will be the mediator
between an actual file system and the kernel. A file system which runs in user
space, must execute a system call through the VFS for switching to kernel mode
and using an operating system service. A typical example is when a program
needs to write some data in a file. All I/O operations, including writing in a
file are privileged. This means that they are only allowed in kernel mode. As
we will see in chapter “A file system call interface for FenixOS”, in FenixOS the
user space file system (for example FAT32) must execute the memoryMap system
call to request write access in a file. Note that executing the memoryMap system
call does not necessarily mean that write access will be granted. The VFS needs
to check if the requested file exists, if the user which executes the system call is
allowed to write in the requested file (according to the file’s permissions), etc.
16 Background
Chapter 3
Copy On Write based file
systems
In the beginning of this chapter I explain why it is not uncommon for modern
hard disks to fail pretty often. I then continue with describing why a better
solution than journaling file systems is required for solving the problems of data
consistency and integrity. Finally, I conclude with an introduction to Copy On
Write: what Copy On Write actually means, how it relates with file systems,
and how it can provide data consistency and integrity efficiently.
File system information can be divided into two parts: data and metadata. Data
are the actual blocks, records, or any other logical grouping the file system uses,
that make up a file. Metadata are pieces of information which describe a file,
but are not part of it. If we take an audio file as an example, the music result
which can be listened when the file is played using a media player is its data.
All the rest information like the last time the file was accessed, the name of the
file, and who can access the file, are the file’s metadata [47, chap. 2].
Even if modern hard disk capacity is rapidly increasing and is considered rela-
18 Copy On Write based file systems
tively cheap, there is one problematic factor which still remains constant: the
disk Bit Error Rate (BER). In simple terms, BER is the number of faulty (al-
tered) bits during a data transmission, because of noise, interference, distortion,
etc., divided by the total number of bits transmitted [25]. Modern disk manu-
facturers advertise that there is one uncorrectable error every 10 to 20 terabytes
[117].
Both data and metadata are of great importance. On metadata corruption, the
file system and therefore the user is completely unable to access the corrupted
file. On data corruption, although the user can access the metadata of the file,
the most important part of it becomes permanently unavailable. Thus, it is
important for modern file systems to provide both data and metadata integrity.
Ensuring that data and metadata have not been altered because of BER solves
a number of problems. But there is still one serious problem which needs to
be solved: consistent system updates. The most common file systems (and
operating systems in general) do not provide to programmers and system ad-
ministrators a simple way of ensuring that a set of operations will either succeed
or fail as a single unit, so that the system will end up in a consistent state. A
simple example of an operation which requires this guarantee is adding a new
user in a UNIX-like system. In UNIX-like systems, adding a new user requires
a couple of files to be modified. Usually modifying two files, /etc/passwd and
/etc/shadow is enough. Now, imagine a scenario where the utility which is re-
sponsible for adding a new system user (for example useradd) has successfully
modified /etc/passwd and is ready to begin modifying /etc/shadow. Suddenly,
the operating system crashes because of a bug or shuts down due to a power
failure. The system ends up in an inconsistent state: /etc/passwd contains
the entry for the new user but /etc/shadow is not aware of this modification.
In such cases, it makes more sense to ignore the partial modifications and do
nothing, instead of applying them and ending up with an inconsistent system.
Ignoring the partial modifications is not a bad thing. It keeps the system con-
sistent.
Now that we have discussed why it is (1) normal for modern hard disks to fail,
(2) important for file systems to enforce both data and metadata integrity and
consistency, let’s identify what kind of problems do the most popular file systems
have.
Most of the file systems which are currently used in operating systems work
3.3 Copy On Write 19
Journaling file systems have a big problem: they can only provide metadata
integrity and consistency. Keeping both data and metadata changes inside the
journal introduces an unacceptable performance overhead, which forces all jour-
naling file systems to log only metadata changes. Since writing the changes
twice - once to the journal and once in real - is extremely slow, a better solution
is required [47, chap. 7], [72]. The recommended solution must provide both
metadata and data integrity and consistency in a cheap way.
The most recent solution regarding file system (meta)data consistency and in-
tegrity is called Copy On Write (COW). Before getting into details about what
COW (sometimes called shadowing [102]) is in respect to file systems, I will give
some background information about what COW is in general, and discuss some
applications of it.
If you do not understand some of the terms that are mentioned in the next
20 Copy On Write based file systems
paragraph, do not worry. They are given just for completeness. Only one
of them is important for the needs of my thesis: snapshots. A snapshot is a
consistent image of the data at a specific point in time [2, 55]. By data I mean
the contents of a file system, the contents of a database, etc. Snapshots can
be read only, or read/write. Writable snapshots are also called clones [102].
Snapshots are extremely useful and have many applications: data recovery,
online data protection, undo file system operations, testing new installations
and configurations, backup solutions, data mining, and more [72, chap. 1],
[102].
COW is used in many areas of computer science. Only a few to mention are
(1) optimising process creation in operating systems [79, chap. 5], (2) creating
secure, isolated environments, appropriate for running untrusted applications
on top of primary file systems [57], (3) designing systems which are based on
delimited continuations and support advanced features, such as multitasking
and transactions [61], (4) supporting efficient snapshots in database systems
[105], (5) improving snapshots to support both undo and redo operations in
storage systems [131], and (6) combining snapshots with pipelining for more
efficient snapshot creation [127].
Now that I have introduced COW to you, I should underline that optimisation is
not the main reason for using COW in file systems. I believe that COW’s major
contributions to file systems are the support for (1) taking cheap snapshots,
(2) ensuring both data and metadata consistency and integrity with acceptable
performance [72, chap. 1].
When using a COW based file system, offering snapshot support is straightfor-
ward. COW in file systems means a very simple but at the same time very im-
portant thing: do never overwrite any (meta)data [83, chap. 1]. All (meta)data
3.3 Copy On Write 21
modifications are written to a new free place on the disk. The old (meta)data
cannot be freed until all the changes complete successfully. Since data are never
overwritten, it is easy to take a snapshot. You just have to say to the file sys-
tem “do not to free the old space, even after the successful completion of the
changes”. Give a name to the old space, and your snapshot is ready. That is
what I mean when talking about cheap snapshots.
Transactions have been supported in databases for a long time, but the same is
not true for file systems and operating systems [96, 114, 130]. Some journaling
file systems support transactions but the problem still holds: they can only
protect metadata. Using COW and transactional semantics, file systems can
provide full data consistency.
Digest algorithm 5 (MD5), and then I will repeat the same process after adding
a single character to the file’s contents.
Even with a so simple change, the MD5 checksum is completely different. This
indicates that the file has changed. Note that md5sum works only for data,
not for metadata, thus changing the metadata of the file - for example its access
permissions using chmod - will produce exactly the same checksum. File systems
which support (meta)data integrity do it on a per-block level, or even better
[83, chap. 1]. In this way, they can provide full data integrity.
Not all COW based file systems support (meta)data integrity and consistency.
On the next chapter I am focusing on ZFS [51] and Btrfs [9]. ZFS and Btrfs
support (meta)data integrity. ZFS also supports (meta)data consistency, while
Btrfs is at the moment more relaxed regarding consistency. Other file sys-
tems based on COW are LLFS [72], ext3cow [93], Write Anywhere File Layout
(WAFL) [55], and Hammer [43].
Chapter 4
Overview, performance
analysis, and scalability test of
ZFS and Btrfs
In this chapter I begin with an overview of two modern COW based file systems:
ZFS and Btrfs. The chapter continues with a performance analysis between the
two file systems using recommended benchmarks. At the end of this chapter
I am running a small scalability test to see how well can ZFS and Btrfs scale
against large multithreaded workloads.
ZFS and Btrfs are the latest efforts of the free software [116, chap. 3] community
aiming at creating a scalable, fault tolerant, and easy to administrate file system.
In the previous chapter I have focused only on fault tolerance (consistency and
integrity) because I consider it the most important design goal. However, when
the file system designers are designing a new file system, they are trying to
solve as many problems as possible. Two additional problems with current file
systems are (1) their inability to satisfy the large storage requirements of today’s
data centres [27], (2) the lack of a simplified administration model [83, chap. 1].
24 Overview, performance analysis, and scalability test of ZFS and Btrfs
ZFS and Btrfs are very similar regarding the features that they already support
or plan to support, but after taking a deeper look one can find many differences
in their design, development model, source code license, etc. In the following
sections I will try to take a brief tour on ZFS and Btrfs.
To begin, ZFS and Btrfs are based on COW. They use COW semantics for pro-
viding support for snapshots and clones. ZFS and Btrfs use checksums which
are 256 bits long on both metadata and data to offer integrity, and support mul-
tiple checksumming algorithms. To address the scaling requirements of modern
systems, ZFS and Btrfs are designed based on an 128 bit architecture. This
means that block addresses are 128 bits long, which adds support for extremely
large files, in terms of million terabytes (TB) - 1012 [64], or even quadrillion
zettabytes (ZB) - 1021 [83, chap. 1]. Both file systems have multiple device sup-
port integrated through the usage of software Redundant Array Of Inexpensive
Disks (RAID). RAID is a way of distributing data over multiple hard disks, for
achieving better performance and redundancy. Multiple RAID algorithms exist,
and most of them are supported by ZFS and Btrfs. Finally, both file systems
support data compression for increasing throughput [27, 51].
ZFS is the first file system which introduces many unique features that no other
file system ever had. Some of them are (1) its pooled storage, (2) its self-healing
properties, and (3) the usage of variable block sizes. These unique features are
enough to make ZFS one of the most interesting file systems around. Let me
say a few words about each feature.
it can correct it using the “right” block (the one with the expected checksum)
from a redundant disk [17, 18].
Most file systems use a fixed block size. ZFS supports variable block sizes, using
the concept of metaslabs ranging from 512 bytes to 1 megabyte (MB) [21]. The
main idea of using multiple block sizes is that since files usually have a small
size initially, a small block size will work optimally for them. But as the size
of a file gets increased, its block size can also be increased, to continue working
optimally based on the new file size [51].
There are even more things that make ZFS different. First of all, instead of
following the typical design of most POSIX file systems which use index-nodes
(i-nodes) as their primary data structure [120, chap. 4], ZFS organises data into
many data structures, each one being responsible for a very specific job: For
instance (1) variable sized block pointers are used to describe blocks of data, (2)
objects are used to describe an arbitrary piece of storage, for example a plain
file, (3) object sets are used to group objects of a specific type, for example
group file objects into a single file system object set, (4) space maps are used to
keep track of free space, etc. [19, 84, 132].
Another big difference of ZFS compared to other file systems, including Btrfs
which will be described in the next section, is that it is designed to be platform
independent. Because of that, it integrates many operating system parts which
traditional file systems tend to reuse, since they already exist on the operating
system they are being developed for. Examples of such parts are the (1) Input
Output (I/O) scheduler, which decides about the priorities of processes and
which one to pick from the waiting queue. ZFS uses its own I/O scheduler
and a scheduling policy called System Duty Cycle (SDC) [50, 70]. (2) Block
cache, which is the part of the main memory used for improving performance
by holding data which will hopefully be requested in the near future. ZFS uses
its own cache and a modification of the Adaptive Replacement Cache (ARC)
algorithm [132]. (3) Transactional semantics. In contrast to Btrfs as we will
see below, ZFS uses transactions with full ACID semantics for all file system
modifications to guarantee that data are always consistent on disk.
ZFS already supports data deduplication. There is also a beta version of encryp-
tion support. Encryption is important for security reasons, while deduplication
is important for saving disk space. Regarding compression, ZFS keeps metadata
compressed by default. Therefore, enabling compression in ZFS means that data
compression is activated. The minimum requirements for using ZFS are at least
128 MB hard disk space, and at least 1 GB of main memory [83, chap. 2]. Some
people think that the memory requirements of ZFS are too high [71].
To conclude this section, I would like to mention that ZFS is not a research
26 Overview, performance analysis, and scalability test of ZFS and Btrfs
effort which might be abandoned like many other file systems in the past, but
a successful product: ZFS is already the primary file system of the OpenSolaris
operating system. It can also be used optionally as the primary file system in
FreeBSD, instead of UFS [42].
Btrfs is not as ambitious as ZFS. The idea behind it is to use simple and well-
known data structures, to achieve good scalability and performance - equal or
even better than competing GNU/Linux file systems, including XFS, ext3/4,
and ReiserFS [31, 77]. I think that the main reason behind the beginning of
the Btrfs development was the legal issues which do not allow porting ZFS to
GNU/Linux. The Linux kernel is distributed under the General Public License
(GPL) version 2, while ZFS is distributed under the Common Development
and Distribution License (CDDL) version 1.0. GPL and CDDL are both free
software licenses, but they are not compatible. A module covered by CDDL
cannot be legally linked with a module covered by GPL [48].
I believe that the biggest advantage of Btrfs over ZFS is the bazaar development
model it follows. Like almost all GNU/Linux related projects after the successful
development of the Linux kernel, Btrfs is open for contributions by anyone. This
model has worked very well in the past, and is likely to work better than the
cathedral development model of ZFS, which was created by a single company:
former Sun Microsystems, now part of Oracle [100, chap. 2].
Time for a short technical discussion. Instead of using variable sized blocks,
Btrfs uses fixed block sizes and groups them into extents. An extent is a group
of contiguous data blocks [27, 33]. Using extents has some benefits, like reduced
fragmentation and increased performance, since multiple blocks are always writ-
ten in a continuous manner. Of course, like any logical grouping technique,
extents have their own issues. For example, when data compression is enabled,
even if you only need to read a part of an extent, you have to read and decom-
press the entire of it. For this reason, the size of compressed extents in Btrfs is
limited.
Everything around Btrfs is built based on a single data structure: the COW
friendly b+tree [9]. When using COW with a tree data structure, a modifica-
tion of a single node propagates up to the root. Traditional b+trees have all
their leaves linked. This fact makes b+trees incompatible with COW, because
a modification of a single leaf means that COW must be applied on the whole
tree. Moreover, applying COW on the whole tree means that all nodes must
be exclusively locked and their pointers must be updated to point at the new
4.1 ZFS and Btrfs overview 27
places, which results in a very low performance. The COW friendly b+tree in-
troduces some major optimisations on traditional b+trees, to make them COW
compatible. To be more specific, the COW friendly b+tree refers to (1) b+trees
without linked leaves, to avoid the need of applying COW on the whole tree
when a single leaf is modified, (2) the usage of a top-down update procedure
instead of a bottom up, which results in applying COW to a minimum number
of top nodes, and (3) the usage of lazy reference counts per extent for supporting
fast and space-efficient cloning [102].
Before closing the Btrfs discussion, I should say that Btrfs is still under heavy
development and is not ready to be used as the primary file system for storing
your critical data.
28 Overview, performance analysis, and scalability test of ZFS and Btrfs
Before getting into technical details, I should say that in contrast to what most
people think, benchmarking is not an easy thing. At least, proper benchmarking
[123]. It lasts forever, and it requires to take care of many small but extremely
important details [124]. Many people use benchmarks, but most of the results
they present are not accurate [125]. I admit that I felt very impressed when I
found out that a whole thesis can be devoted in analysing the performance of
file systems [38]. I did my best to keep my tests as fair and accurate as possible.
The scalability test experiments have been conducted on a virtual machine con-
figured with VirtualBox 3.1.8. We are aware that the results of a virtual machine
are not as accurate as the results of experiments executed on real hardware [125].
However, even if a public machine with sufficient hardware resources is available
(four 2.93 GHZ Intel Xeon X5570 CPUs and 12 GB of RAM), I do not have
the privileges to add a benchmark disk for running my experiments. The best I
can do is to configure a virtual machine with two virtual disks on the available
public machine. The virtual machine has 16 CPUs with the Physical Address
Extension (PAE/NX) option enabled. The motherboard has the Input Output
APIC (IO APIC) option enabled, and the Extended Firmware Interface (EFI)
option disabled. From the acceleration options, VT-x/AMD-V and Nested Pag-
ing are enabled. The virtual machine contains 4 GB of RAM to avoid excessive
caching. The operating system is installed on a virtual IDE disk with 16 GB
capacity. The IDE controller type is PIIX4. The benchmark virtual disk is a
4.2 ZFS and Btrfs performance analysis and scalability test 29
The operating system for running the Btrfs performance analysis experiments
is Debian GNU/Linux testing (squeeze). The system is running a 2.6.34 kernel,
downloaded from the Btrfs unstable Git repository [32] at the 29th of May 2010,
and not further customised. For the ZFS performance analysis experiments, the
system is OpenSolaris 2009.06 (x86 LiveCD) [90].
For the scalability test the operating systems are the same as the ones used in
the performance analysis tests, with the only difference that their 64 bit versions
are installed.
The machine is rebooted before the execution of a new benchmark, for example
when switching from bonnie-64 to IOzone. It is not rebooted between each
execution of the same benchmark, for example between the second and third
execution of an IOzone test. This is done to make sure that the system caches
are clean every time a new benchmark begins. Every test is executed 11 times.
The first result is ignored, because it is used just for warming the caches. All
the results are calculated using 95% confidence intervals, based on Student’s
t-distribution. No ageing is used on the operating systems. All non-essential
services and processes are stopped by switching into single user mode before
executing a test. This is done using the command sudo init 1 in Debian, and
svcadm milestone svc:/milestone/single-user:default in OpenSolaris. I
am using a script-based approach to minimise typing mistakes [123].
ZFS and Btrfs are always mounted with the default options. This means that in
ZFS metadata are compressed and data are uncompressed. In Btrfs, both data
and metadata are uncompressed. The scalability test is done only on a single
disk. The performance analysis experiments are conducted both on a single
30 Overview, performance analysis, and scalability test of ZFS and Btrfs
disk and on a software RAID 1 configuration using two disks (mirroring). I will
demonstrate two simple examples of creating a RAID 1 ZFS and Btrfs configura-
tion for completeness. If c7d0 is the disk which has OpenSolaris installed, c7d1
is the primary disk, and c8d1 the secondary, the command to create a RAID
1 ZFS pool called mir is zpool create mir mirror c7d1 c8d1. Similarly, if
/dev/sda is the disk which has GNU/Linux installed, /dev/sdb is the primary
disk, and /dev/sdc the secondary, the command to create a RAID 1 Btrfs file
system is sudo mkfs.btrfs -m raid1 -d raid1 /dev/sdb /dev/sdc.
I will begin with the performance analysis results of bonnie-64. bonnie-64 is the
64 bit version of the classic Bonnie micro-benchmark [15], written by the original
Bonnie author. Bonnie has become popular because of its simplicity, and the
easy to understand results which it produces [24]. For these reasons, no matter
how old it is and despite its disadvantages [125] Bonnie is still extensively used in
serious developments [37, 125]. I am using bonnie-64 version 2.0.6, downloaded
from the Google code repository [13]. Tables 4.1 and 4.2 show the results of
bonnie-64 for ZFS and Btrfs, on the single disk configuration.
I have chosen to set the file size (“Size” column) to 12 GB to make sure that the
system caches are bypassed. The author of Bonnie suggests to set this number
equal to at least four times the size of the main memory [15]. I am setting
it to six times the size of the computer’s RAM to ensure that the caches will
not be used. From all the results displayed in Tables 4.1 and 4.2, the most
important are the block sequential input and output. Per character operations
are not of great importance since no serious application which makes heavy
I/O usage will read/write per single character [76]. Rewrite is also not very
important since we know that COW based file systems never write data in
place. However, it can still provide an indication about how effective the cache
of the file system is [40] (if the cache is not bypassed) and how well a COW file
system performs when doing rewrite operations. A last thing to note is that
the CPU usage and the random seeks are not precise for multiprocessor (and
therefore multi-core) systems [15]. For this reason I will use the CPU usage just
as an indication about how busy the CPUs are. The random seek results will
4.2 ZFS and Btrfs performance analysis and scalability test 31
be used as a general indicator of the efficiency of the “random seek and modify”
operation relative to the speed of a single CPU. An example of the command
that I execute for getting the performance analysis results using bonnie-64 is
Bonnie -s 12288 -d /mnt/btrfs -m "Single Disk" >> bonnie-btrfs-0.
Table 4.2: Sequential input and random seeks on the single disk
What we can generally observe in Tables 4.1 and 4.2 is that Btrfs outperforms
ZFS on most tests. More specifically, Btrfs has an output rate of 52 MB/second
when writing the 12 GB file per block, while ZFS reports 28 MB/second. This
most probably happens because the b+tree extents are optimised for sequential
operations and perform better than the ZFS metaslabs [19]. Another reason
behind the higher sequential output data rates of Btrfs is the usage of simple
and fast block allocation policies. As we will see, while such policies help the
Btrfs file system to achieve high sequential output data rates on simple systems
(for example servers with a single hard disk), they have a negative impact when
using more complex configurations (redundant schemes like RAID 1).
When reading the 12 GB file per block, both file systems have the same input
rate, 52 MB/second. This verifies that extents are good for both sequential
reads/writes, since the file system can read/write in a contiguous manner. A
reason why ZFS has a much better sequential input performance compared to its
sequential output might be that it needs to use transactional semantics during
file system writes [35, 50]. Btrfs transactions violate ACID and are most likely
faster. Also, I do not think that they are used to protect all write operations
[41]. ZFS transactions add more extra overhead since they protect all file system
32 Overview, performance analysis, and scalability test of ZFS and Btrfs
modifications and respect ACID semantics. Transactions are not needed when
reading data, since nothing needs to be modified. That makes ZFS sequential
reads faster than sequential writes.
In the “Random Seeks” test bonnie-64 creates 4 child processes which perform
4000 seeks to random locations. On 10% of the random seeks, bonnie-64 actually
modifies the read data and rewrites them. What we can see in this test is that
Btrfs has an effective seek rate of 89-95 seeks/second, while ZFS has a lower rate:
39-40 seeks/second. This most likely happens because Btrfs is tightly connected
with the Linux kernel, and performs better than ZFS regarding random seeks
and modifications when the cache is not utilised. ZFS depends heavily on its
ARC cache, and cannot achieve a high random seek rate when the cache is
bypassed [101].
A negative aspect about Btrfs is that it generally keeps the CPUs more busy
than ZFS. Using a single data structure (b+tree) for doing everything results in
a simpler file system architecture and makes coding and refactoring easier than
using multiple data structures - objects, object sets, metaslabs, space maps,
AVL trees, etc. [19, 84, 132]. But I believe that it also affects certain operations,
which end up being CPU bound.
Let’s move to the results of bonnie-64 when RAID 1 (mirroring) is used. The-
oretically, when mirroring is used the output rates should be slightly lower
compared to the ones of using just a single disk. This is because the file sys-
tem must write the same data twice (on both disks). The input rates might be
increased, if certain optimisations are supported by the RAID implementation.
For example, if the file system can read from both disks simultaneously, the
input rates will be increased. Tables 4.3 and 4.4 display the mirroring results.
The first thing that we can observe is that the output rates are more or less
either the same or slightly decreased, both for ZFS and Btrfs. The same is not
true for the input rates. Here we can see that ZFS, with a per block input rate
of 82 MB/second, outperforms Btrfs, which has a per block input rate of 52-53
MB/second. This tells us that the software RAID 1 implementation of ZFS can
4.2 ZFS and Btrfs performance analysis and scalability test 33
achieve better input rates that the Btrfs implementation. ZFS is more scalable
regarding sequential input, which most likely means that the dynamic striping
(any block can be written to any location on any disk) [51] of ZFS works better
than the object level striping of Btrfs.
The things are different when observing the “Random Seeks” test. In this case,
the Btrfs implementation seems to make use of both disks, and achieves an ef-
fective seek rate of 161-182 seeks/second. ZFS on the other hand, achieves a
little higher effective seek rate than it does when a single disk is used: 66-68
seeks/second. The same argument still holds here: Btrfs, being tightly inte-
grated with the Linux kernel and designed towards simplicity, achieves a higher
random seek rate than ZFS when the cache is bypassed.
The next performance analysis results that we will investigate are produced by
the Bonnie++ micro-benchmark [14]. I have decided to use Bonnie++ because I
managed to find some results [88] that I can use as a reference. Apart from that,
Bonnie++ is interesting due to its metadata related operations [76], which are
not supported by bonnie-64. The version of Bonnie++ I am using is 1.96 [39].
An example of a Bonnie++ execution using the same configuration as in [88] is
bonnie++ -s 16384 -d /mnt/btrfs -m "Single Disk" -n 8:4194304:1024
-u 0 -f -q >> bonnie-btrfs-0. To get details about what each command ar-
gument is, please refer to the Bonnie++ user manual.
A few notes before analysing the Bonnie++ results: (1) I have skipped the not
really interesting per character results, to fasten the benchmarking process. (2)
Bonnie++ by default reports results in KB/second. I have converted KB into
MB [94] and rounded them up using a single digit precision, to make the results
more readable. (3) Bonnie++ normally reports more numbers than the ones
shown in my results, for example disk latencies. Whatever I have skipped, I
did it because either it was too time consuming to include it, or the benchmark
could not report meaningful numbers, despite the different configurations I have
tried [76]. Anyway, this is not very important. The important thing of this
experiment is to see the relations between my bonnie-64 results, my Bonnie++
results, and the Bonnie++ results of [88].
Tables 4.5 and 4.6 show the Bonnie++ single disk results. First of all, the
differences between the numbers presented in Tables 4.1, 4.2, 4.5, 4.6, and [88]
are absolutely normal. This is happening because there are differences between
the underlying hardware, the versions of software used, the operating systems,
etc. [125]. What we should focus is not the actual numbers, but whether the
numbers agree or not.
Table 4.6: Random seeks and metadata creation on the single disk
To begin, all results indicate that Btrfs achieves higher sequential output rates
than ZFS. My tests show that the sequential input rates on single disks are
identical for both file systems - slightly higher for Btrfs. I believe that the low
performance of ZFS presented in [88] results because ZFS is configured to run
to the user space of a GNU/Linux system, using a service like File system in
User space (FUSE) [46]. In my tests ZFS runs in the kernel space of its native
operating system. That is why it is performing much better. Having said that,
it seems that extents can achieve higher sequential output rates than metaslabs
on single disks. The sequential input rates on single disks are slightly higher for
extents over metaslabs.
4.2 ZFS and Btrfs performance analysis and scalability test 35
The next thing that we should observe is that in contrast to the bonnie-64
results, in Bonnie++ ZFS achieves a higher effective seek rate (137-148 seek-
s/second) than Btrfs (68-69 seeks/second). One thing that we should be aware
of about the random seek test before making any conclusions, is that it is not
the same for bonnie-64 [16] and Bonnie++ [40]. Two important differences are
the number of the created child processes (four in bonnie-64 against three in
Bonnie++) and the number of seeks which are performed (4000 in bonnie-64
against 8000 in Bonnie++). Without being able to make a safe conclusion be-
cause of such differences, the random seek result of Bonnie++ indicates that
ZFS allocates related blocks close on-disk to reduce the disk arm motion [120,
chap. 4].
The final observation about the single disk results is that ZFS outperforms Btrfs
in the metadata creation tests: “Sequential Create” and “Random Create”. I
think that it is expected for ZFS to perform much better than Btrfs in all
metadata tests, since in ZFS metadata are compressed by default, while in
Btrfs they are not.
It is time for the RAID 1 results of Bonnie++. They are shown in Tables 4.7
and 4.8.
The first thing that we observe is that the output rates are consistent with the
output rates shown in Table 4.3. Btrfs achieves higher output rates (both block
and rewrite) than ZFS. The per block output rate of Btrfs, 48.4-48.6 MB/second,
is much higher than the per block output of ZFS, 26-26.3 MB/second. The
rewrite output rate is slightly higher for Btrfs: 21.5-21.8 MB/second, against 19-
19.3 MB/second for ZFS. Extents achieve higher sequential output and rewrite
rates on both single disks and mirroring. Those results verify that extents can
scale regarding sequential output.
The input rates and the effective seek rates are also consistent with the rates
of Tables 4.3 and 4.4. ZFS achieves a per block input rate of 76-77.7 MB/sec-
ond, while Btrfs reports a per block input rate of 54.5-54.9 MB/second. The
input rates verify that ZFS has a better data striping (called dynamic striping)
implementation than Btrfs. Regarding random seeks (202-247 for ZFS against
109-118 for Btrfs), it is expected that the total number of read seeks will be
increased when mirroring is used [40]. ZFS is still better on this test, which
I suspect that happens because the block allocation policies of ZFS can scale
more than those of Btrfs.
The final numbers that we should observe are related with metadata creation.
There are no great changes when mirroring is used regarding metadata creation
(“Sequential Create” and “Random Create”). ZFS still performs much better
than Btrfs, and this is happening because metadata in ZFS are compressed by
36 Overview, performance analysis, and scalability test of ZFS and Btrfs
You should now have an idea about the usefulness of micro-benchmarks. They
raise important questions like: (1) “Why Btrfs achieves higher sequential output
rates than ZFS?”. “Is it because extents work better than variable block sizes?”.
“Or is it because ZFS is designed with innovation and portability in mind (com-
plex data structures, separate cache, separate I/O scheduler, etc.), while Btrfs
with simplicity (simple data structures, system caches, kernel’s scheduler, etc.)?”
“Is it expected or there is space for improvements on this ZFS part?”. (2) “What
is wrong with the RAID 1 input rates of Btrfs?”. “Should the implementation
be improved to support reading from both disks simultaneously? Or reads are
not efficient because data are not striped efficiently when they are written on the
disks?”. (3) “Why is Btrfs keeping the CPUs so busy when performing random
seeks?”. “Is it because the b+tree algorithms are very CPU bound?”. “Is this
fine or the Btrfs developers must optimise the algorithms to make the file system
less CPU bound?”. (4) “Should Btrfs reconsider adding metadata compression
to improve its metadata related operations, or this is not very important?”.
The Filebench fileserver summarised results for Btrfs and ZFS on the single disk
configuration are displayed in Table 4.9. The fileserver test uses 50 threads by
default during its execution. I have divided the Filebench results into three
categories. “Throughput breakdown” shows the number of operations the file
system can execute. “Bandwidth breakdown” shows how many MB/second the
file system is able to use for any operation: read, write, random append, etc.
“Efficiency breakdown” shows the code path length in instructions/operation
38 Overview, performance analysis, and scalability test of ZFS and Btrfs
[108]. For all the reported results except “Efficiency breakdown”, the higher the
numbers are, the better the result is.
What we can see by observing Table 4.9 is that Btrfs has a higher throughput
breakdown and bandwidth breakdown than ZFS. Btrfs is able to execute 32091-
35755 operations in total and 535-596 operations/second, while ZFS can execute
15760-16387 in total and 260-270 second. Btrfs achieves a bandwidth rate of 12-
14 MB/second, while ZFS achieves a bandwidth rate of 6 MB/second. ZFS has
a better efficiency breakdown, which is 410-586 instructions/operation, whereas
Btrfs has a longer code path length of 722-769 instructions/operation. This
generally means that ZFS works more efficiently with multiple threads than
Btrfs, even though it cannot outperform it (in this test).
The RAID 1 fileserver summarised results for Btrfs and ZFS are shown in Table
4.10. After observing the results, there are not a lot of things to comment. The
only change is that the numbers are higher, but the outcomes are still the same:
Btrfs has a higher throughput breakdown and bandwidth breakdown than ZFS,
while ZFS has a higher efficiency breakdown than Btrfs.
Apart from the summarised results that I have discussed, Filebench provides
detailed results about the performance of the file system per operation. For ex-
ample, Filebench reports the throughput, bandwidth, and efficiency breakdown
of a single read file or write file operation. Such results are very important for file
system developers, since they can help them recognise which specific operations
4.2 ZFS and Btrfs performance analysis and scalability test 39
Since I had the chance to cheat and take a look into the detailed results, I
can tell you that Btrfs beats ZFS in the read, random append, and create
tests. Especially file creation seems to be the bottleneck of ZFS. To give you
some numbers, the read latency of Btrfs on the single disk is in terms of 155.9
millisecond (msec), while in ZFS is in terms of 586.9 msec. I suspect that this
happens because Btrfs achieves better short-term prefetching than ZFS [101].
The random append latency on the single disk is in terms of 478.8 msec in Btrfs,
while in ZFS is in terms of 613.7 msec. This makes sense since all file system
modifications in ZFS are protected by ACID and are slower than Btrfs. File
creation on the single disk is in terms of 0.1 msec in Btrfs, whereas in ZFS it
is in terms of 552.4 msec! File creation is slow in ZFS because when a file is
initially created, its metadata should also be compressed, since metadata are
compressed by default in ZFS.
By using the macro-benchmark emulator and after observing the detailed results
of Filebench, we can see that Btrfs outperforms ZFS on the fileserver test and is
the right choice to use on NFS servers with small disk capacities, and applica-
tions that give special emphasis on reading, creating, and randomly appending
files.
In this experiment I will try to run some basic tests with IOzone, using several
different file sizes and a fixed record size of 64 KB. The 64 KB record size emu-
lates applications that read and write files using 64 KB buffers. The goal of the
experiment is to see the out of the box performance of Btrfs and ZFS against a
predefined workload. An example of using IOzone to conduct this experiment
is iozone -i 0 -i 1 -i 2 -g 4G -q 64K -a -S 6144K -f /prim/i >> io.
For details about what each argument means, please consult the user manual of
IOzone [89]. I am using IOzone version 3.347 (iozone3 347.tar) [56].
40 Overview, performance analysis, and scalability test of ZFS and Btrfs
Tables 4.11 and 4.12 show the IOzone results on the single disk configuration.
Only the “best” results are displayed. By best I mean the results where most
of the tested file system operations achieve the maximum performance. IOzone
reports results in KB/second. Using the same approach I used with the results
of Bonnie++, I have converted KB into GB to make the output more readable.
The first thing that looks strange after observing the IOzone results is that
both file systems report very high data rates: in terms of GB/second! I have
discussed this issue with the main IOzone developer, Don Capps. His response
was:
“Once the file size is larger than the amount of RAM in the sys-
tem, then you know for sure that you are seeing purely physical I/O
speeds. Until the file size gets that large, the operating system and
hardware are going to be trying to cache the file data, and accelerat-
ing I/O requests. From your results, it appears that your operating
system has about 1 Gig of RAM that it is using to cache file data.
If you look closely at the re-read results, you can also see the impact
of the CPU L2.”
In my case, the file size (4 GB) is two times larger than the amount of the
system RAM (2 GB). It seems that the high rates occur because most data are
cached.
in the performance of ZFS. ZFS outperforms Btrfs in all tests but “Write”. All
the results of IOzone except (sequential) write contradict with the results of
bonnie-64 and Bonnie++. This is not strange. In the bonnie-64 and Bonnie++
tests I bypassed the caches by setting a very big file size. In this test I make sure
that the file system cache will be used by setting a small file size (-g 4G). I also
make sure that the CPU level 2 cache will be used by defining it (-S 6144K).
When the caches are used ZFS outperforms Btrfs on the majority of tests: COW
works better in ZFS (rewrite). ZFS has a better cache utilisation than Btrfs
(reread). And ZFS reports higher read, random read, and random write data
rates. For once again, the Btrfs extents achieve higher sequential write data
rates than the ZFS metaslabs. I believe that there might be a relation between
the write test results and the slow file creation of ZFS, since before start writing
in a new file you first need to create it.
Let’s move on to the RAID 1 results of IOzone. They are shown in Tables 4.13
and 4.14. There are not a lot of things to comment regarding the mirroring
results, since there are no significant changes. ZFS still outperforms Btrfs in all
tests but write.
All results confirm that ZFS is the appropriate file system to use, both on single
disks and on RAID 1 configurations, when the caches are utilised. In this test
I am focusing on applications which use 64 KB buffers. But I believe that
ZFS outperforms Btrfs when the caches are utilised no matter the size of the
application buffers. This happens because the advanced ARC cache of ZFS
performs much better than the existing system caches of the Linux kernel that
Btrfs makes use of.
In this experiment I am using again Filebench to test how well can ZFS and
Btrfs scale. This time, I am using a different Filebench test. It is called varmail
[110], and is nothing more than a multithreaded version of PostMark [58, 124].
Varmail uses by default 16 threads. The major operations of the varmail test
are opening, closing, and reading files. The rest operations are file system
synchronisation (fsync), random file appends, and file deletes. I have decided
not to include any examples of running varmail, since you can consult Example
4.1. Only the name of the test changes. The results of varmail are displayed in
Table 4.15.
It is not hard to see that ZFS beats Btrfs in all tests without problems. ZFS
has better throughput breakdown, better bandwidth breakdown, and achieves
a much shorter code path length. Btrfs has big troubles with the varmail test,
which reveals its scaling problems. I was once again curious to see the specific
operations where Btrfs cannot perform well. I found out that the biggest pain
of Btrfs is by far fsync. Let’s see the results of two fsync tests. In the first test
ZFS reports 68.7 msec, while Btrfs reports 7949.2 msec! In the second test, ZFS
reports 1403.8 msec, whereas Btrfs reports 5945.3 msec. In both cases, fsync
in Btrfs is significantly slower than it is in ZFS. It seems that the issues of Btrfs
with fsync are well-known to the free software community [103]. Btrfs is not
(at least yet) optimised to work synchronously.
This chapter begins with a discussion regarding legacy file system interfaces. It
then continues with presenting my contributions to the design of a file system
call interface for FenixOS. The final part of the chapter includes a graphical
representation and textual description of the major file system calls, as well as
a hypothetical example of using the FenixOS file system interface.
In chapter “Copy On Write based file systems” I have discussed about how COW
based file systems provide (meta)data consistency using transactional semantics.
I have explained what transactions are, and what ACID means with regard to
file systems. I have also briefly mentioned the big problem with current file
systems and operating systems: the lack of support for transactional semantics.
Time to extend this discussion a little more.
Most system call interfaces of UNIX-like operating systems are POSIX compli-
ant. Other system call interfaces follow their own approach, which is actually
not very different from POSIX [120, chap. 1]. Either POSIX compliant or not,
44 A file system call interface for FenixOS
the system call interfaces of all commercial operating systems have the same
problem: they do not support transactions [96]. Transaction support at the op-
erating system level is critical. Transactions provide a simple API to program-
mers for (1) eliminating security vulnerabilities, (2) rolling back unsuccessful
software installations and failed upgrades, (3) ensuring that file system updates
are consistent, (4) focusing on the logic of a program instead of struggling to
find out how to write deadlock avoidance routines, (5) write programs which
can perform better than lock-based approaches, etc. [11, 74, 96, 114, 126, 130].
We use system transactions to protect the system state. The reasons are that
system transactions (1) do not require any special hardware or software (for
example special compiler support), (2) support system calls, (3) everything in-
side a system transaction (including system calls) can be isolated and rolled
back without violating transactional semantics, (4) can be optionally integrated
with Hardware Transactional Memory (HTM) [11, 74] or Software Transactional
Memory (STM) [126] to protect the state of an application [96].
Nowadays, most platforms are based on multi-core and multiple processors. For
taking advantage of such architectures, multithreading and parallel program-
ming are essential. Transactions can be extremely helpful for concurrent pro-
gramming. Modern operating systems do not have an excuse for not supporting
transactions, since databases have been supported them without problems for a
long time [114, 130].
Of course, not only operating systems should support transactions, but they
should also expose them to programmers. ZFS for instance, even though it offers
ACID semantics for all file system modifications, transactions are transparently
processed at the Data Management Unit (DMU) layer. They are only used
internally and are not visible to programmers. At the file system level, only a
legacy POSIX interface is available through the ZFS POSIX Layer (ZPL) [132].
Hopefully at this point I have clearly expressed my thoughts about the usefulness
of transactions. Transactions are not useful only for file systems. They can also
be useful for other operating system parts. Such parts are processes, memory,
networking, signals, etc. [96]. But since my work focuses on file systems, my
discussions from now on will be mostly related with the file system call interface.
The final part of my thesis includes defining a modern file system call interface
for FenixOS. My focus is on the higher interface of the file system to user
5.2 FenixOS file system call interface 45
The fact that FenixOS is a research operating system is very convenient for
introducing new aspects. For example people who want to support transac-
tions on systems like GNU/Linux must do it with a minimal impact on the
existing file system and application code [114]. The result is either a slow user
space implementation, or a complex user API. Others, for instance ZFS, sup-
port transactions only at the file system level. And even though ZFS supports
transactions, it does not expose them. This limits the usefulness of transac-
tions. In FenixOS we do not have such concerns. We do not need to be POSIX
compliant because it is not a big deal for us if POSIX code cannot be used on
FenixOS. We would rather prefer to provide to the programmers a better system
call interface. A system call interface where transactions are exposed, and using
them is intuitive and straightforward. And we believe that it is much better for
us to design a transaction oriented system call interface from scratch, instead
of trying to add transaction support to a non transaction oriented interface like
POSIX.
At this point I would like to clarify that I have nothing against POSIX. POSIX
was defined to solve a very specific problem. And indeed, I believe that it
solves it very well. But with the transition to multiprocessor and multi-core
architectures, the usage of concurrent programming models raises the need for
designing new system call interfaces, which are more appropriate to transaction
oriented code. If necessary, we can always create a POSIX compatibility layer
to support POSIX code in FenixOS. Many operating systems have already done
this successfully. Only a few examples are APE, ZPL, and Interix [82].
After studying several system call interface designs, I have concluded that the
most innovative is the one used in ZFS. The “Related Work” chapter includes a
discussion about each interface, but let me say a few words about them. Btrfs
violates ACID semantics [41] which is not what we want. Valor is only file sys-
tem oriented, has a complex API, and does not use COW [114]. We focus on
COW and we want to provide an intuitive and simple API to users. Moreover,
we would like to use transactions not only in file systems, but also in the rest
operating system parts. Oracle’s Berkeley DataBase (BDB) is not considered a
file system call interface, since it can be used only on flat files like databases [91].
Even though there are user space solutions based on BDB which provide trans-
actions, they have proved to be very slow. TxOS is intuitive to use and has nice
features, but it is tightly connected with the implementation of the Linux ker-
nel. TxOS relies heavily on the data structures and mechanisms of GNU/Linux
which means that it has a complex GNU/Linux related implementation [96].
46 A file system call interface for FenixOS
ZFS seems to be a good choice. Obviously I am not referring to the legacy ZPL
layer. This is nothing more than yet another POSIX interface. The interesting
part of ZFS in this case is the DMU layer. When I discussed about ZFS, I have
mentioned how it uses the concepts of objects and object sets to organise the on-
disk data. Objects, object sets, and transactional semantics are extensively used
in the DMU layer [84, chap. 3], [132]. We believe that following a design similar
to the design of DMU is the right approach for creating a transaction oriented
file system call interface for FenixOS. Moreover, using object based approaches
can contribute to the creation of semantic and searchable VFS infrastructures
[104, 113].
DMU cannot be used as is. The first issue which is implementation related is the
programming language. While ZFS is written in C FenixOS is written in C++.
I am not planning to begin another language flamewar here. A serious technical
comparison of several programming languages, including C and C++ can be
found in [98]. If you would like my opinion, I believe that C is faster, simpler, and
requires less memory than C++. On the other hand, using C++ in operating
systems helps you to take advantage of all the standard object oriented features -
encapsulation, inheritance and polymorphism - as well as some exclusive features
of the language: namespaces, operator overloading, templates, etc. Another
problem is that DMU cannot be exposed as is. Since DMU is not exposed to
programmers in ZFS, issues like file permissions and protection are taken care
by the ZPL layer. In our case, we have to solve such issues at the VFS level.
Let me discuss about the design decisions regarding the file system call interface
of FenixOS. Our main goal is to design a file system call interface which supports
transactions. Most transaction oriented interfaces are inspired by Oracle’s BDB
[91]. A programmer uses simple begin and commit statements to initiate and
complete a transaction. During the transaction, the programmer can use the
abort statement to terminate the transaction and ignore all its changes. Some
interfaces also provide a rollback functionality, which does the same thing as
undo: it reverts all the changes of the most recent commit of the transaction
which is rolled back. Most interfaces also provide the option of relaxing ACID
(usually isolation and durability) guarantees for the sake of optimisation [91,
chap. 1], [114]. We think that violating ACID is a poor design decision and we
do not wish to adopt it.
Our interface will support the highest database isolation degree, which is number
3 [91, chap. 4]. Moreover, the interface will use a strong atomicity model.
Defining the model of atomicity which will be used is important, since a direct
5.2 FenixOS file system call interface 47
conversion of lock based code to transaction based code is generally unsafe [10].
We will use a strong atomicity model because it ensures that system updates
are consistent even if both transactional and non-transactional activities try
to access the same system resources. This is done by serialising transactional
and non-transactional code and deciding about which one should be executed
first, by assigning priorities. There is no need to use HTM/STM, since we
want to protect only the system state and not the data structures of a specific
application [96]. However, integrating STM with the interface of FenixOS should
be straightforward, in case the state of an application needs to be protected.
Apart from offering transactions, we have decided to increase the flexibility of the
rest file system parts. An example of such a part is file protection. Traditional
mode style file attributes [69] used by default in most UNIX-like operating
systems are simple, but they are not rich enough to cover the needs of complex
system configurations. As the complexity of a system increases, the inflexibility
and the poor semantics of traditional UNIX attributes cause big headaches to
system administrators. To solve this problem, some systems provide Access
Control List (ACL) support. To put it simply, ACLs are entries which describe
which users or processes have the permission to access an operating system
object, and what operations they are allowed to perform on it. ACLs have
better semantics and are richer than traditional UNIX file permissions, allowing
systems to provide a better security model than the security model of traditional
mode based systems.
We have decided to support the standard ACLs of the Network File System
Version 4 Minor Version 1 (NFSv4.1) Protocol [106]. In contrast to “POSIX”
ACLs which are not really POSIX since they were never accepted by the POSIX
standard, NFSv4.x ACLs are a standard part of the NFS protocol. Apart from
being standard, NFSv4.x ACLs are richer than “POSIX” ACLs. For those
reasons, operating system developers have started to implement support for
NFSv4 ACLs. ZFS, OpenSolaris, and FreeBSD already support NFSv4 ACLs
[73], [83, chap. 8], [87].
Operating systems which add support for NFSv4.x but also support “POSIX”
48 A file system call interface for FenixOS
ACLs and traditional mode file attributes are facing a big challenge. It is very
hard to keep backward compatibility with all permission models. One problem
is that conversions from one model to another are complex. A bigger problem
is that when translating from a finer-grained to a less fine-grained model, for
example from NFSv4 to “POSIX” ACLs. In this case, the NFSv4 semantics
which are not supported by “POSIX” ACLs must be sacrificed [4]. Also, com-
puting the mode from a given ACL and making sure that both are synchronised
is time consuming and increases the complexity of the file system [44]. To avoid
all those issues, we have decided to support only native NFSv4.1 ACLs. We are
not planning to support “POSIX” ACLs or traditional mode style file attributes.
This means that users will need some time to get familiar with NFS ACLs. But
this is for their benefit.
It is time to present the actual file system calls that I ended up with. Let’s start
with some general concepts.
be character files, which are used for modelling serial I/O devices like terminal
and printers, or block files which are used for modelling disks. Directories
are files that can contain other files inside them and are used to organise the
structure of a file system.
File
There are many different ways of implementing files and directories. Different
file systems follow different approaches. For example, in Btrfs a file is described
by a name and the i-node related with it. An i-node is the data structure used
for keeping track of the blocks that belong to a particular file [120, chap. 4].
ZFS uses a different data structure, called the dnode. A dnode is a collection
of blocks that make up a ZFS object [84, chap. 3]. Whether the file system
uses i-nodes, dnodes, or another data structure is not our concern, since we are
working on a higher level. What we should observe is that all file systems need
a way of keeping track of the blocks that belong to a particular file. It does
not matter if this is a plain file, a device file, or a directory (which is nothing
more than a list of files). Thus we can abstract all those different kinds of files
and their data structures in a single concept: the VObject. A VObject can be
used to model a regular file, a special file, a directory, etc. It is up to the file
system developer to take care of the implementation details regarding the data
structures related with a VObject.
grouping those files together. This is what a file system is: a group of files
sharing the same properties. Such properties can be the (1) owner of the file
system, which is the operating system user who has full privileges over it, (2)
file system type, which indicates whether it is ZFS, Btrfs, FAT32, etc., (3) root
directory, which is the parent directory of all the files that belong to the file
system, etc. In FenixOS, the abstraction which is used to describe a file system
is the VObjectSet. Examples of VObjectSets are a ZFS file system installed
on a hard disk, a FAT32 file system installed on a USB stick, etc.
Up to now we have defined the abstractions of the different file and file system
types. But this is not very far from what all VFS tend to do [63]. We should not
forget that we want an interface which supports transactions, and exposes them
to programmers in a simple way. Our next step is to define what a transaction
is in FenixOS. A VTransaction in FenixOS adds support for transactional se-
mantics to the file system call interface. An executed file system call is always
under the control of a VTransaction. Transactions control whole VObjectSets,
and operate on their VObjects. This makes sense, since a file (VObject) always
belongs to a file system (VObjectSet).
By observing Figure 5.2 we can see that the most important class of the file sys-
tem call interface is VTransaction. All the file system calls are inside it because
whatever happens during a system call must be protected by ACID. A signifi-
cant different between our interface and DMU is that a single VTransaction can
operate on more than one VObjectSets. This means that a single transaction
can modify more than one file systems. DMU supports only one object set/-
transaction. But we believe that supporting multiple object sets/transaction is
important. A good usage of this feature is creating an index of the contents of
a ZFS file system which is installed on the hard disk, and a FAT file system
which is installed on a USB stick atomically. This can be done only if multiple
VObjectSets/VTransaction are supported.
Another thing that we can observe in Figure 5.2 is that only the VTransaction
can add VObjects in a VObjectSet. That is why there is no direct connec-
tion between a VObject and a VObjectSet. That also makes sense, since files
should be created/removed from a file system only using transactional seman-
tics. If files can be added/removed from a file system outside a transaction,
then transactions are useless.
5.2 FenixOS file system call interface 51
Table 5.1 includes a short description of the file system calls, in the same order
as shown in Figure 5.2. Let me say a few words about each system call. In the
following descriptions I assume that the user who executes the system calls has
the permissions to perform the requested operations. What happens in real is
that every system call tries to satisfy the requested service. Only if the user
has the necessary permissions, the system call will apply the requested changes.
Otherwise, an error number, message, etc. is returned to the user.
can be used when a program want to write inside a file. memoryUnmap does the
opposite work. It removes the VObject from the process.
begin states that the transactional part of the program begins. Only the code
between begin and commit is protected by ACID. commit Only the code en-
closed between begin and commit is protected by ACID. commit states the end
of a transaction which means that after commit all the file system modifications
of the transaction are permanent. Finally abort can be used by the programmer
at any time to exit a transaction immediately and ignore all the modifications
that happened inside it. An example of aborting a transaction is when you want
for instance to create a new file and edit it within the same transaction. If you
get a file system error because the file already exists (but did not exist before
your transaction) it means that most probably somebody else has created it
before you. In this case it is better to abort the transaction instead of trying to
modify the file.
At this point you might argue that a UML class diagram and a description of
the system calls is not enough for getting the whole picture. For this reason
I have decided to include a small and simplified example which shows how a
programmer can use the file system interface of FenixOS. The example will not
work if you try it, since my job was to define and document the system calls,
not to implement them.
Figure 5.3 shows a typical hierarchical file system structure of UNIX-like sys-
tems. The parent directory of all, called the root directory is represented
as /. Two different file systems reside in the root directory. One is a ZFS
VObjectSet, installed on the hard disk and located in /zfs. The other is a
FAT32 VObjectSet, installed on a USB stick and located in /media/stick.
In both file systems, there are different kinds of files. For example in the ZFS
VObjectSet there is a VObject called “bob” which is a plain file, and a VObject
called “work” which is a directory. What we want to do is to use our interface
for renaming the “bob” VObject to “alice”, and the “alice” VObject to “bob”
atomically. If one of the rename operations fails, the other should not take
place. This scenario might sound useless to you, but it is very similar with the
scenario of adding a new system user. Think of it for a moment. The add user
scenario says “I want to add this line in both /etc/passwd and /etc/shadow”.
5.2 FenixOS file system call interface 53
Either both files must be modified or none of them. The only difference is that
in my scenario the two files reside in two different file systems, while in the add
user scenario both files are located in the same file system.
The example of Listing 5.1 begins with the assumption that the two VObjectSets
shown in Figure 5.3 exist and are called zfs and fat. Initially the two object
sets which will be modified are added to a single VTransaction. The code con-
tinues by defining the on-disk path of the two objects (in this case plain files)
which are also assumed to exist and will be modified. The next step is to a
define the permissions that will be applied when trying to open the objects for
processing. To get more details about the NFSv4.x permissions please consult
[106]. The next part of the code deals with informing the transaction about
the specific objects which are going to be modified. Since a transaction controls
whole object sets, it is important to inform it about the specific object sets
that you are planning to modify. This is important because we are following
the semantics of the DMU [1] which requires from the programmers to state
explicitly which objects they intend to create or modify. The next system call,
begin, states that the transactional part of the code actually begins. After
that, the two objects are opened for processing. openObject returns the object
identifiers of the objects. After opening the objects, the actual rename takes
place. Normally the programmer would continue to work with the object sets
and their objects, but for the needs of this simple example we are done. The
objects are closed so that the object identifiers can be reused by other objects,
and the transaction is committed, which means that the changes will be applied
on the hard disk and the USB disk, respectively.
54 A file system call interface for FenixOS
Figure 5.2: Compact class diagram of the FenixOS file system call interface. A
transaction uses at least one object set which acts as the file system.
Objects (plain files, directories, etc.) are added/removed from the used
object set(s) under the control of a transaction
56 A file system call interface for FenixOS
zfs media
Figure 5.3: The directory structure of the example. Under the root directory /, there
are two object sets: A ZFS object set is located in /zfs, and a FAT object
set is located in /media/stick. Two files of the object sets, located in
/zfs/bob and /media/stick/alice must be renamed atomically
Listing 5.1: Example of renaming two files atomically using a single transaction
// c r e a t e a t r a n s a c t i o n and add two o b j e c t s e t s
V T r a n sa c t i o n t x ( z f s )
tx . addObjectSet ( f a t )
// d e f i n e t h e on−d i s k p a t h o f t h e o b j e c t s t h a t w i l l be m o d i f i e d
const char z f s P a t h [ ] = ‘ ‘ / z f s / bob ’ ’
const char f a t P a t h [ ] = ‘ ‘ / media / s t i c k / a l i c e ’ ’
// d e f i n e t h e o b j e c t p e r m i s s i o n s t o t r y a p p l y i n g on open
mask = ACE4 READ DATA | ACE4 WRITE DATA | ACE4 READ NAMED ATTRS |
ACE4 WRITE NAMED ATTRS | ACE4 EXECUTE | ACE4 DELETE
// inform t h e t r a n s a c t i o n a b o u t t h e m o d i f i c a t i o n o f t h e o b j e c t s
tx . prepareForChanges ( z f s , zfsPath )
tx . prepareForChanges ( fat , fatPath )
// t h e t r a n s a c t i o n a l p a r t s t a r t s now
tx . begin ( )
// open t h e o b j e c t s f o r p r o c e s s i n g
z f s I d = t x . op enO bjec t ( z f s , F i l e , z f s P a t h , mask )
f a t I d = t x . ope nObj ect ( f a t , F i l e , f a t P a t h , mask )
// rename t h e two o b j e c t s
t x . renameObject ( z f s , z f s I d , ‘ ‘ bob ’ ’ , ‘ ‘ a l i c e ’ ’ )
t x . renameObject ( f a t , f a t I d , ‘ ‘ a l i c e ’ ’ , ‘ ‘ bob ’ ’ )
5.2 FenixOS file system call interface 57
// c o n t i n u e w o r k i n g . . .
// c l o s e t h e o b j e c t s when done
tx . c l o s e O b j e c t ( z f s I d )
tx . c l o s e O b j e c t ( f a t I d )
// commit t h e t r a n s a c t i o n
t x . commit ( )
58 A file system call interface for FenixOS
Chapter 6
Development process, time
plan, and risk analysis
In this chapter I first talk about the development process that I have followed
during my thesis. I then show a time plan with the tasks I have completed and
give some details about each task. The chapter ends with an analysis of the
risks and a description of the challenges that I have faced during the thesis.
system calls. During my thesis I have focused on (1) dividing the project into
small iterations, (2) communicating with my supervisor on a weekly basis, (3)
doing a risk analysis, (4) refactoring, and (5) respecting the coding standards.
Table 6.1 shows the time plan of my thesis. At each iteration, the goal was to
plan accurately only the next two tasks and provide only a rough estimation for
the rest. This means that most tasks were overestimated or underestimated [5].
Let me say a few words about each task. In task “Operating system reliability,
structure, and performance study” I read papers and articles related with the
reliability, structure, and performance of operating systems [23, 49, 52, 53, 54,
95, 118, 119, 121]. The next task, “Legacy file system study”, included reading
textbook chapters and a paper related with the structure and organisation of
the most common file systems [78], [115, chap. 12], [120, chap. 4]. In task
“Btrfs design study” I tried to gather information about the design of the Btrfs
file system [9, 31, 64, 77, 80, 102]. After this task I made a presentation about
the Btrfs design to the rest FenixOS developers, to give them an idea about
it. In task “ZFS on-disk format and design study” I studied the design of ZFS
[2, 17, 18, 19, 20, 21, 35, 51, 70, 83, 84, 117, 132]. After learning some things
about both file systems, I prepared a document describing the major similarities
and differences between ZFS and Btrfs. Most if not all of them are discussed
in chapter “Overview, performance analysis, and scalability test of ZFS and
Btrfs”.
At this point we agreed with my supervisor that our next step would be a
performance analysis and scalability test between ZFS and Btrfs. However there
was a problem: We had to wait for the IT support to take care of our ticket
request regarding the preparation of the required hardware for doing the tests.
This took a long time, and it was going to last even longer. Thankfully, a
system administrator has kindly provided us the hardware at the end, faster
than expected. Anyway, I tried to use time efficiently instead of just waiting for
the hardware. My first idea was to cooperate with two students who where doing
their Bachelor thesis. Their thesis involved a reimplementation of the ZFS stack
for FenixOS. But since this is a lot of work, they were focusing on the read parts.
All stuff related with writes were left as future work. So the idea was to look
into the missing parts. Bad idea actually, since this is my first project related
with operating system programming. Understanding the heavily undocumented
C code of ZFS proved not to be a good start. At least I revised the code of the
students, which involved code writing, bug fixing, and documenting. During
this task I found [26] very useful in understanding the on-disk format of ZFS.
6.2 Time plan 61
After the end of this task I was left without a coding related topic to work on.
While searching for a relevant topic, I decided to read some more material about
COW and file systems in general [37, 43, 47, 55, 57, 61, 72, 93, 105, 127, 131]. I
came up with some ideas to my supervisor, but he came up with a better one:
The design of “A file system call interface for FenixOS”. You can find the details
about my outcomes in the relevant chapter.
While I was ready to start reading about transaction oriented file system call
interfaces and modern file permissions, the hardware for doing the performance
tests was ready. Before doing the actual tests, I studied about how to approach
benchmarking and which frameworks are recommended, and how do they work.
This is what task “Proper benchmarking study and framework selection” was
about [12, 14, 56, 107, 123, 124, 125]. The next task, “ZFS and Btrfs perfor-
mance analysis and scalability test” involved executing the tests, collecting the
results, and evaluating them. After completing the tests, it was time to start the
second part of the thesis. Task “Transactional semantics and modern file permis-
sions study” involved reading papers about transactional memory, transactional
semantics, and NFS permissions [8, 10, 11, 74, 91, 96, 106, 114, 126, 130]. After
reading some additional papers and book chapters about virtual file systems
and memory mapping [22, chap. 12], [63, 75], [79, chap. 5], [104, 113], I came
up with a UML class diagram of a transaction oriented file system call inter-
face - “First draft of a transactional VFS”. It turned out that the diagram was
more complex than it needed to be. The problem was that I included too many
ideas. While some of them are important for users, they are not really important
for FenixOS at the moment. Examples include arbitrary object attributes [47,
chap. 5] and smart pointers [3]. After having a discussion with my supervisor,
we decided that I should focus only on the most important things: transactions
and file permissions.
Task “FenixOS file system call interface design” involved studying more care-
fully the semantics of DMU [84, chap. 3], [132], simplifying the UML diagram,
defining the system calls presented in chapter “A file system call interface for
FenixOS”, and making sure that the interface is intuitive and straightforward to
use. The final task, “Thesis report and feedback corrections” included writing
the report you are reading, and revising it after getting the necessary feedback.
At the same time, I tried to polish the system calls.
The first thing that I should note is that I am using Agile Risk Management for
the needs of my thesis, because it fits well with AUP [81]. In Figure 15.2 of [36,
chap. 15] you can see the potential risks of a new computer science/information
technology project. Not all of these risks are related with my thesis. For example
“Unclear payment schedule” is irrelevant, since my thesis is not funded. A
description of the identified risks follows.
From the commercial risks, an ill-defined scope can cause problems and create
misunderstandings. We made sure that the scope of the thesis was clearly
6.3 Risk analysis 63
The visible planning and resource risks were that I personally lack key skills
since I have not participated in any operating system programming projects in
the past. To minimise this risk we made sure that I had the proper guidance
during the prototype implementation part of the thesis. We also ensured that
the topic was not unrealistic and too ambitious, because there was a limited
amount of time for working on it.
All the risks described above are too abstract and in the rest of this section I
will try to be more concrete and identify the specific risks which are directly
related with my thesis.
Due to financial and house contract issues, I had to go back at my home coun-
try in the middle of July, and continue working on my thesis remotely. My
supervisor said that we could do this if we were careful with the communication
issues. The problem was that he has not tried it in the past. Therefore, this
case involved some risks: Communication issues, remote access to computers,
technical assistance. I think that everything worked well, if we exclude the
communication part.
It took some time until I could find a programming related topic to work on.
I finally found one in the middle of June. The first problem is that the topic
should involve some research. It cannot be implementation related only. This
is a problem because even if a lot of research is done on file systems, most
missing features of COW based file systems where I am focusing involve only
implementation.
The fact that I am not an experienced C++ system programmer itself involved
many risks. I am interested in operating system design and development. This
is the main reason why I approached my supervisor and asked him if we could
find a topic as a thesis, assuming that I have no previous operating system
programming experience. However, it seems that an MSc thesis assumes that a
student is already an expert to the topic he/she selects, and that the result of
the thesis should be a contribution to the research community. This is definitely
not my case. I am familiar with C and C++, but I have not been involved in
kernel development or file system development in the past.
I had spent some time doing things that were not directly related with my
thesis. In the beginning I thought that I would have to port ZFS or Btrfs in
FenixOS. But this is not considered advanced enough to be presented as the
work of an MSc thesis. Then comes the story about adding write support in the
ZFS reimplementation that I have already explained you.
64 Development process, time plan, and risk analysis
I have also wasted some time running irrelevant benchmarks. The problem is
that ZFS has metadata compression enabled by default, and deactivating it is
not suggested. When I first started running the benchmarks for Btrfs, I enabled
compression, assuming that it is metadata compression. I later discovered that
Btrfs does not support metadata compression at all. It was the data that I was
compressing. And compressed data outputs cannot be compared between ZFS
and Btrfs, because they do not support the same compression algorithm. Thus,
I had to get the Btrfs results again with compression disabled, and live with the
fact that metadata will be compressed only in ZFS.
I had the wrong impression that file systems are a good starting point to ap-
proach operating systems, because they usually belong to a higher level, com-
pared to the rest operating system parts: process management, memory man-
agement, scheduling, device drivers, etc. The problem is that modern file system
code is very complex, because it is trying to solve hard problems, such as relia-
bility, recoverability, easy administration, and scalability, but at the same time
needs to perform extremely well and fast.
I had to attend a 3-week course which started in the beginning of June and
lasted until the end of June. The course kept me busy from Monday to Friday,
9:00 - 17:00. I tried to do some work for my thesis in the remaining hours, but
my productivity was obviously reduced until the end of the course.
Chapter 7
Related work
This chapter is a discussion about the work of other people focusing on (1)
analysing the performance and testing the scalability of ZFS and Btrfs, (2)
designing transaction oriented system call interfaces.
There is a lot of work available which compares Btrfs against competing file
systems of GNU/Linux [67, 68] or presents individual file system results [97].
But at the time I have decided to do my tests I could only find two comparisons
between Btrfs and ZFS [50, 88]. I cannot say a lot about [88] since there is no
analysis and evaluation of the results. As I have noted in a previous chapter, I
believe that ZFS is running in user space in [88]. That is why it has very low
performance. In [50] ZFS outperforms Btrfs in all tests but sequential write.
However, my results have shown that Btrfs has evolved since that time. The
micro-benchmarks results have shown that Btrfs outperforms ZFS in sequential
writes and rewrites, and performs equally to ZFS in sequential reads. ZFS is
better than Btrfs in metadata creation. All these are happening on single disk
configurations. On mirroring, ZFS outperforms Btrfs in all tests but sequential
output.
66 Related work
Recently more ZFS and Btrfs comparisons have been published [65, 66]. Even
if the phoronix results are not the most reliable, since people often criticise the
company for publishing black box results without trying to understand them,
we can make some observations by analysing the results. In [66], the low perfor-
mance of ZFS when using FUSE is confirmed. What we can also observe is that
in some cases COW based file systems can equally compete or even perform
better than popular journaling file systems, like ext4. This means that even if
COW based file systems are not the best option to use on a database server
(pgbench test), they are ready to serve our everyday needs and offer us all their
advanced features at a small performance cost. In my opinion, it does not make
sense to use a COW based file system on databases since all good database man-
agement systems offer ACID semantics and snapshot support using journaling
at good performance. By observing the results of [65], we can also observe that
(increased thread count results) COW based file systems can scale better than
their competitors.
One of the most interesting system call interfaces is TxOS [96]. TxOS adds
support for system transactions on a GNU/Linux system. This means that the
kernel data structures and memory buffers are modified to support transactions.
Like FenixOS, TxOS uses a strong atomicity model for serialising transactional
and non-transactional code [10]. What I most like about TxOS is its simplicity.
Programmers just need to enclose the code that they want to be transactional
between the sys_xbegin and sys_xend system calls. That is all. Wrap the
code that adds a new user between these two system calls and the code will be
protected by ACID. There is no way that a system crash will leave /etc/passwd
or /etc/shadow in an inconsistent state. Either both files will be updated or
none of them. I tried to keep the same simplicity when defining the interface
of FenixOS. The only difference is that a programmer has to use the various
prepareFor system calls before executing begin. But this is the way DMU
works. Other attractive features of TxOS are the usage of COW for supporting
ACID and its easy integration with STM and HTM to provide protection of an
application’s state.
Another effort for providing ACID semantics to the file system is Amino [130].
Amino offers transaction support to user level applications, without changing
the POSIX interface of GNU/Linux. Amino is built on top of BDB. It uses
ptrace to service file system calls and stores all data in a b-tree schema of BDB.
Similar to TxOS, Amino simplifies the usage of transactions by providing the
basic begin, abort, and commit system calls. As the authors of Amino admit,
7.2 Transaction oriented system call interfaces 67
the biggest pain of supporting transactions in user space is the high overheads
that they generate, especially when data-intensive workloads are used. Such
overheads are not acceptable and result in low system performance [114].
A more recent effort of the Amino developers is Valor [114]. Valor is a transac-
tional file interface built as a GNU/Linux kernel module. Valor is designed to
run on top of another file system, for example ext3. A fundamental difference
between Valor and TxOS is that Valor does not use COW. It facilitates logging
instead, by using a separate log partition to keep a record of all the transactional
activities. Beware that logging is not the same as journaling. While journaling
can protect only metadata and supports transactions which are finite in size and
duration, logging provides support for very long transactions. COW has also the
same problem with journaling regarding transaction length: While COW can
protect both data and metadata, by default it supports transactions provided
that they can fit into the main memory of the computer. Logging does not
have such a limitation, since the log is kept on the hard disk. Researchers argue
that the transaction length limitation of COW can be solved by swapping the
uncommitted transaction state to the hard disk or any other secondary storage,
but I have not found any implementations which solve this problem yet [96].
Nevertheless, I believe that the benefits of using COW instead of logging are
more than the actual disadvantages. For instance, logging has the same disad-
vantage as journaling: All changes need to be written twice. Once to the log
and once to the actual file system. Write ordering is used to ensure that the
changes are first written to the log and then to the file system.
The thing that I do not like most about Valor is the complexity of using it.
Valor has seven system calls that are tightly connected with its architecture. A
programmer needs to understand the system calls and ensure that they are all
used in the right order. This is by far more complex than using simple begin,
abort, and commit statements. A final thing to note is that Valor can be used
only on file systems. It cannot be used as an interface for offering transactions at
the operating system level. This is unfortunate, since transactions can be useful
not only for file systems, but also for other system parts: processes, memory,
networking, signals, etc.
68 Related work
Chapter 8
Conclusions
This is the final chapter of my thesis. It summarises the contents of the thesis,
outlines my contributions, and proposes future work.
8.1 Conclusions
Modern hard disks are not perfect. According to disk manufacturers, it is ex-
pected that there will be one uncorrectable error every 10 to 20 TB of capacity.
Disk capacities in terms of thousand TB are normal for today’s data centres.
Moreover, since disk capacity is relatively cheap, many personal computer users
have already started to use disks with capacities in terms of TB.
The majority of file systems and operating systems that are currently in use
cannot offer consistent system updates. Most modern operating systems are
using a journaling file system, which is unable to provide data integrity and
consistency. Journaling file systems can only protect metadata. Not protecting
data is not a design decision of the file system developers. Data are equally or
even more important than metadata. It is a limitation of journaling file systems.
Since we cannot rely on hardware (disks fail, power failures shutdown systems,
etc.) or human (software bugs crash operating systems, failed software upgrades
lead to broken systems, etc.), a better solution than journaling is needed.
70 Conclusions
The most recent solution is called COW. COW provides (meta)data consis-
tency using transactional semantics. Also, it can offer (meta)data integrity
using checksums at an acceptable overhead. Not only COW offers (meta)data
consistency and integrity, but it also supports the creation of cheap snapshots
and clones. Users can take fast online backups without the need to use any spe-
cial software technique like LVM, or expensive and complex commercial backup
software. Everything is taken care by the file system.
ZFS and Btrfs are two modern free software COW based file systems. ZFS
has a very interesting architecture and design. It is the first file system which
supports many nice features that no other file system ever had. Some examples
are its pooled storage, its self-healing properties, and the usage of variable block
sizes. In contrast to Btrfs, ZFS uses full ACID semantics to offer transactional
support, and is stable enough to be used as the primary file system of OpenSo-
laris and optionally FreeBSD. Btrfs is trying to build everything around a single
concept: the COW friendly b+tree. Unlike ZFS which is designed to be plat-
form independent, Btrfs is reusing as many existing GNU/Linux components as
possible. The main goal of Btrfs is to provide to the users all the nice features
of COW, but at the same time achieve better performance and scalability than
competing GNU/Linux file systems.
The performance analysis between ZFS and Btrfs has shown the strengths and
weaknesses of each file system. Because of its simplicity and tight integration
with GNU/Linux, Btrfs performs better than ZFS on single disk systems when
the caches are bypassed, and as soon as it becomes stable, it seems to be the right
choice for all but metadata creation intensive applications on single disks. Btrfs
should also be the preferred choice for NFS file servers and applications which
rely on creating, reading, and randomly appending files. Note that all these
apply only when the caches are bypassed. When the caches are utilised, ZFS
seems to outperforms Btrfs in most cases, but I cannot make any conclusions
since in my micro-benchmark tests I bypassed the caches. ZFS is the right choice
for both single disks and mirroring for applications which use exclusively 64 KB
buffering. I believe that when the caches are utilised, ZFS can outperform Btrfs
no matter the size of the buffers. ZFS is also more scalable than Btrfs, which
means that it is more appropriate to use it on mail servers, database servers,
and applications that require synchronous semantics (fsync). Since FenixOS
focuses on fault tolerance and scalability, ZFS is the best choice for FenixOS at
the moment.
With the transition to multi-core and multiple processor systems, offering in-
tuitive concurrent programming models becomes a must. Transactions can be
extremely useful in concurrent programming. Only a few examples of using
transactions in operating systems are eliminating security vulnerabilities, en-
suring file system consistency, and providing an alternative to lock-based pro-
8.2 Future work 71
gramming approaches. Not only modern operating systems and file systems
should support transactions, but they must also expose them to programmers
as a simple API. We use system transactions since no necessary hardware or
software is required for supporting them. Furthermore, system transactions can
be isolated, rolled back, and execute multiple system calls without violating
transactional semantics.
Most UNIX-like operating systems are POSIX compatible. But we believe that
the time has come to define a modern transaction oriented, object based system
call interface. My work focuses on the file system call interface of FenixOS.
I have discussed the most important design decisions of our interface. Our
major decisions include using (1) COW to provide transactional semantics, (2)
NFSv4.1 ACLs as the only file permission model, (3) memory mapped resources
for replacing the legacy read/write system calls. I have also presented the file
system calls that I have defined and gave an example of how they can be used
in practise.
Since Btrfs is under heavy development and new features are still added in
ZFS, it is important to continue analysing their performance and testing their
scalability. It will be interesting to see the updated results of [50]. The author
informed me that he is planning to update his results at the end of 2010. Also,
I would like to see if all the results of bonnie-64 and Bonnie++ are consistent
when using the same file size, since there is a difference in the effective seek
rates when the size is changed from 12 GB to 16 GB. I had the chance to test
the two file systems only on single disks and RAID 1 but since they support
more RAID configurations, it is important to see how they can perform on those
configurations. Since Filebench can emulate more applications, like web servers
and web proxies, it will be nice to see the results of such tests. In the workload
generator test, I used IOzone with a fixed record size of 64 KB and concluded
that the variable block size mechanism of ZFS worked better than the default 4
KB block size of Btrfs. It would be also interesting to see what happens if the
block size of Btrfs is set to 64 KB and a variable IOzone record size is used.
I have documented the design goals and defined the file system calls of the
FenixOS interface, but unfortunately there was no time to start implementing
it. It would be a nice experience for me to see how COW, transactions, NFS 4.x
ACLs, and memory mapped resources are implemented. Since the interface is
minimal, it is likely that I have omitted some necessary system calls. I believe
that it will not be hard for the developers who will implement the first file system
72 Conclusions
for FenixOS to add any required system calls. Finally, my design is relaxed.
For example, apart from an id and a type, a VObject does not contain any
other information. It might be necessary to add more details about VObjects,
VObjectSets, etc. at the VFS level. Some examples are including the checksum
of a VObject and the owner of a VObjectSet.
Bibliography
[26] Bruning, M. ZFS On-Disk Data Walk (or: Where’s my Data?). http://
www.bruningsystems.com/osdevcon_draft3.pdf. Retrieved: 2010-09-
10.
[36] Cadle, J., and Yeates, D. Project Management for Information Sys-
tems. Pearson Prentice Hall, 2008.
76 BIBLIOGRAPHY
[42] Dawidek, P. J. Porting the ZFS File System to the FreeBSD Operating
System. AsiaBSDCon 2007 (2007), 97–103.
[44] Falkner, S., and Week, L. NFSv4 ACLs: Where are we go-
ing? http://www.connectathon.org/talks06/falkner-week.pdf. Re-
trieved: 2010-09-06.
[47] Giampaolo, D. Practical File System Design with the Be File System.
Morgan Kaufmann, 1999.
[49] Härtig, H., Hohmuth, M., Liedtke, J., Wolter, J., and
SchHönberg, S. The performance of µ-kernel-based systems. In ACM
Symposium on Operating Systems Principles (1997), pp. 66–77.
[74] Liu, Y., Zhang, X., Li, H., Li, M., and Qian, D. Hardware Transac-
tional Memory Supporting I/O Operations within Transactions. In Pro-
ceedings of the 2008 10th IEEE International Conference on High Perfor-
mance Computing and Communications (2008).
[91] Oracle. Oracle Berkeley DB. Getting Started with Transaction Process-
ing for C++. 11g Release 2. Oracle, 2010.
[95] Pike, R., Presotto, D., Dorward, S., Flandrena, B., Thompson,
K., Trickey, H., and Winterbottom, P. Plan 9 from Bell Labs.
http://plan9.bell-labs.com/sys/doc/9.html. Retrieved: 2010-08-14.
[96] Porter, D. E., Hofmann, O. S., Rossbach, C. J., Benn, A., and
Witchel, E. Operating System Transactions. In Proceedings of the
22nd ACM SIGOPS Symposium on Operating Systems Principles (2009),
pp. 161–176.
[104] Schandl, B., and Haslhofer, B. The Sile Model – A Semantic File
System Infrastructure for the Desktop. In Proceedings of the 6th European
Semantic Web Conference on The Semantic Web: Research and Applica-
tions (2009).
[105] Shaull, R., Shrira, L., and Hao, X. Skippy: A new snapshot indexing
method for time travel in the storage manager. In Proceedings of the
ACM SIGMOD International Conference on Management of Data (2008),
pp. 637–648.
[106] Shepler, S., Eisler, M., and Noveck, D. Network File System (NFS)
Version 4 Minor Version 1 Protocol. http://tools.ietf.org/search/
rfc5661. Retrieved: 2010-09-06.
[113] Song, Y., Choi, Y., Lee, H., Kim, D., and Park, D. Searchable
Virtual File System: Toward an Intelligent Ubiquitous Storage. Lecture
Notes in Computer Science 3947/2006 (2006), 395–404.
[114] Spillane, R., Gaikwad, S., Chinni, M., Zadok, E., and Wright,
C. P. Enabling transactional file access via lightweight kernel exten-
sions. In Proceedings of the 7th conference on File and storage technologies
(2009), pp. 29–42.
[117] Stanik, J. A conversation with Jeff Bonwick and Bill Moore. ACM
Queue 5 (2007), 13–19.
[118] Swift, M. M., Bershad, B. N., and Levy, H. M. Improving the relia-
bility of commodity operating systems. In ACM Symposium on Operating
Systems Principles (2003), pp. 207–222.
[119] Tanenbaum, A. S. Tanenbaum-Torvalds Debate: Part II. http://www.
cs.vu.nl/~ast/reliable-os/. Retrieved: 2010-08-13.
[120] Tanenbaum, A. S. Modern Operating Systems 3e. Pearson Prentice Hall,
2008.
[121] Tanenbaum, A. S., Herder, J. N., and Bos, H. Can We Make
Operating Systems Reliable and Secure? IEEE Explore 39 (2006), 44–51.
[122] Tanenbaum, A. S., and Woodhull, A. S. Operating Systems Design
and Implementation, Third Edition. Prentice Hall, 2006.
[123] Traeger, A., Joukov, N., Wright, C. P., and Zadok, E. The FSL-
er’s Guide to File System and Storage Benchmarking . http://www.fsl.
cs.sunysb.edu/docs/fsbench/checklist.html. Retrieved: 2010-08-25.
[124] Traeger, A., Zadok, E., Joukov, N., and Wright, C. P. A nine
year study of file system and storage benchmarking. ACM Transactions
on Storage (TOS) 4 (2008), 5–56.
[125] Traeger, A., Zadok, E., Miller, E. L., and Long, D. D. Findings
from the First Annual Storage and File Systems Benchmarking Workshop.
In Storage and File Systems Benchmarking Workshop (2008), pp. 113–117.
[126] Volos, H., Tack, A. J., Goyal, N., Swift, M. M., and Welc, A.
xCalls: safe I/O in memory transactions. In Proceedings of the 4th ACM
European conference on Computer systems (2009).
[127] Wang, Z., Feng, D., Zhou, K., and Wang, F. PCOW: Pipelining-
based COW snapshot method to decrease first write penalty. Lecture
Notes in Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics) (2008), 266–274.
[128] Wells, D. The Rules of Extreme Programming. http://www.
extremeprogramming.org/rules.html. Retrieved: 2010-09-09.
[129] Checksums. http://www.wireshark.org/docs/wsug_html_chunked/
ChAdvChecksums.html. Retrieved: 2010-08-22.
[130] Wright, C. P., Spillane, R., Sivathanu, G., and Zadok, E. Ex-
tending ACID semantics to the file system. ACM Transactions on Storage
(TOS) 3 (2007), 1–42.
BIBLIOGRAPHY 83
[131] Xiao, W., and Yang, Q. Can We Really Recover Data if Storage Sub-
system Fails? In 2008 28th IEEE International Conference on Distributed
Computing Systems (ICDCS) (2008), pp. 597–604.
[132] ZFS Source Tour. http://hub.opensolaris.org/bin/view/Community+
Group+zfs/source. Retrieved: 2010-08-24.