FreeBSD Mastery ZFS - Michael W Lucas
FreeBSD Mastery ZFS - Michael W Lucas
each storage device node to get its serial number. (You can also extract
serial numbers from camcontrol(8).) This will tell you that, say, disk /
dev/da0 is actually disk 3 on shelf 4.
You now have a list of device nodes and their associated serial numbers,
as well as a list of physical locations and serial numbers. Match those up
with each other, and then use GPT labels to attach the location and serial
number to the disk partition you’re using. (See the FreeBSD documentation
or FreeBSD Mastery: Storage Essentials for details on GPT labels.) GPT
labels have a maximum length of 15 characters, so you might have to
truncate long serial numbers. In most serial numbers the last digits are most
unique, so trim off the front.
Combined, disk 9 in shelf 2, with a serial number of WD-
WCAW36477223, might get a label like /dev/gpt/s2d9-AW36477223.
You want your system to use these labels, and only these labels. Disable
GPTID and disk ident labels on the system. This avoids confusion later.
With this setup, during a hardware failure now FreeBSD can tell you
that the third disk on shelf 4, serial number such-and-such, is bad. Given
that information, even the most junior tech at your colocation provider
should be able to pull the right disk.1 Have the tech give you the serial
number of the replacement drive before installation, so you can create the
proper labels.
Advance planning makes outages much less traumatic. We highly
recommend it.
About this Book
This book is for anyone who manages ZFS filesystems or who is curious
about what a modern, high-performance filesystem looks like. While it
focuses on ZFS on FreeBSD, the general ZFS information applies to any
platform running OpenZFS. Parts of this book happen to be applicable to
other implementations, such as Oracle ZFS, but you can’t assume this book
applies to these other implementations.
We really wanted to write a single FreeBSD OpenZFS book, but
limitations in the chosen publishing platforms made that impractical.
FreeBSD Mastery: ZFS covers routine use of ZFS. The next book,
FreeBSD Mastery: Advanced ZFS, covers online replication, performance
tuning, and other topics requiring greater understanding of ZFS. The
second book assumes you understand everything in this book, however.
OpenZFS advances constantly. This book is a static entity. What’s
more, a book that covered every OpenZFS feature would be the size of the
print version of the Manhattan telephone book.2 These books try to offer
what the vast majority of sysadmins must know to run ZFS well. If you’re
looking for a feature we don’t discuss, or you have a special edge case we
don’t cover, definitely check the man pages, the online OpenZFS
documentation, and the FreeBSD mailing lists archives and forums.
Book Overview
Chapter 0 is this introduction.
Chapter 1, Introducing ZFS, gives you a pterodon’s-eye view of the ZFS
software suite. You’ll learn how to look at ZFS filesystems and data pools,
and understand how the large chunks of ZFS fit together.
Chapter 2, Virtual Devices, takes you through the ZFS’ physical
redundancy schemes. ZFS supports traditional mirrored disks and
concatenated devices, but also offers its own advanced parity-based
redundancy, RAID-Z.
Chapter 3, Pools, discusses ZFS storage pools. You’ll learn how to
assemble virtual devices into pools, how to check pools, and how to
manage and query your storage pools.
Chapter 4, Datasets, takes you through what traditionalists would call a
filesystem. Except in ZFS, it’s not really a filesystem. Yes, you put files in
a dataset, but a dataset is so much more.
Chapter 5, Pool Repairs and Renovations, covers making changes to
storage pools. You can expand storage pools with additional disks, repair
failed disks, and tweak pools to support new features.
Chapter 6, Disk Space Management, covers one of the most
misunderstood parts of using ZFS. Why does your 1 TB drive claim to
have 87 TB free? How do you reserve space for some users and limit
others? What about this deduplication stuff? This chapter covers all that
and more.
Chapter 7, Snapshots and Clones, discusses ZFS’ snapshot feature. You
can create a point-in-time photograph of a dataset, and refer back to it later.
You want a copy of a file as it existed yesterday? Snapshots are your
friends. Similarly, clones let you duplicate a filesystem. You’ll understand
both.
Chapter 8, Installing to ZFS, covers installing FreeBSD to a ZFS. The
FreeBSD installer can install a ZFS-based system for you. The installer is
always improving, but the real world is more complex than any installation
program can possibly expect. Knowing how to install the system exactly
the way you want is useful.
Fasten your seat belt and get ready to dive into a filesystem for the 21st
century.
1He will probably screw it up, because that’s what junior techs do. But give the poor guy a
shot.
2See, once upon a time the phone company printed huge books that listed everyone with a
phone and their phone number. No, phone numbers didn’t change so often, because they
were all landlines. But then the dinosaurs knocked the phone lines down, so we went
cellular.
Chapter 1: Introducing ZFS
Starting to learn ZFS isn’t hard. Install a recent FreeBSD release. Tell the
installer you want ZFS. You’ve started. If you’ve never worked with ZFS,
take a moment and install a new FreeBSD with ZFS on a test system or
virtual machine. Don’t choose encryption or any of the fancy customization
options. This trivial install offers an opportunity to look at some ZFS basics
before diving into more complicated setups.
ZFS combines the functions of traditional filesystems and volume
managers. As such, it expects to handle everything from the permissions on
individual files and which files are in which directories down to tracking
which storage devices get used for what purposes and how that storage is
arranged. The sysadmin instructs ZFS in arranging disks and files, but ZFS
manages the whole storage stack beneath them. This chapter separates the
ZFS stack into three layers: filesystems, storage pools, and virtual devices,
using a FreeBSD 10.1 host installed with the default ZFS settings.
To orient you, we start at the most visible parts of the storage stack and
work our way down. Once you understand how the layers fit together, the
rest of this book starts at the foundation and works its way up.
ZFS Datasets
ZFS filesystems aren’t exactly analogous to traditional filesystems, and so
are called datasets. The classic Unix File System (UFS) and its derivatives
and work-alikes, such as modern BSD’s UFS2 and Linux’s extfs, manage
filesystems with a variety of programs. You’re probably well accustomed
to using df(1), newfs(8), mount(8), umount(8), dump(8), restore(8), and
similar commands. ZFS absorbs all of these functions in the zfs(8)
program, which lets you create, destroy, view, and otherwise spindle ZFS
datasets.
Start by viewing existing ZFS datasets with zfs list.
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zroot 429M 13.0G 96K none
zroot/ROOT 428M 13.0G 96K none
zroot/ROOT/default 428M 13.0G 428M /
zroot/tmp 104K 13.0G 104K /tmp
zroot/usr 428K 13.0G 96K /usr
…
This combines the output of mount(8) and df(1), and should look
pretty familiar to anyone who’s managed UFS or extfs.
Each dataset has a name. A ZFS dataset name starts with the ZFS
storage pool, or zpool, the dataset is on. Our first entry is called just plain
zroot. This entry represents the pool’s root dataset, which everything else
hangs off of.
The next two columns show amount of space used and available. The
pool zroot has used 429 MB and 13 GB free.
The REFER column is special to ZFS. This is the amount of accessible
data on the dataset, which is not necessarily the same as the amount of
space used. Some ZFS features, such as snapshots, share data between
themselves. Our zroot entry has “used” 429 MB, but only refers to 96 KB
of data. The pool as a whole has 13 GB free, but 96 KB are accessible
through this specific dataset. That’s not much. The rest of the space is used
for children of this dataset. Chapter 6 gives a detailed discussion of ZFS
disk usage. A dataset’s children include snapshots, volumes, and child
datasets, as you’ll see throughout this book.
Finally we have the filesystem mount point. The zroot ZFS is not
mounted.
Look at the second entry, named zroot/ROOT. This is a ZFS dataset
created for the root filesystem. Like the zroot pool, it isn’t mounted. It
refers 96 KB of data. This apparently isn’t used, which seems strange for a
root filesystem.
The third entry, zroot/ROOT/default, is the current root filesystem. It
uses 428 MB of data, and is mounted on /, the Unix root. It refers to 428
MB, meaning that there’s that amount of data in this dataset.
Why would ZFS split out this from the root filesystem? ZFS makes it
easy to choose between multiple root filesystems. This host runs FreeBSD
10.1, but suppose you must apply some security updates and reboot?
Applying operating system patches always afflicts systems administrators
with a gut-twisting mix of fear and hope. Even a well-tested upgrade can
go wrong and ruin everyone’s day. But ZFS lets you clone and snapshot
datasets. When you upgrade to FreeBSD 10.1-p1, you could create a new
dataset such as zroot/ROOT/10.1-p1 and tell FreeBSD to use that as the
root partition. You either wouldn’t mount zroot/ROOT/default, or I’d
mount it at an alternate location like /oldroot. If the upgrade goes poorly,
reversion is trivial.
The next dataset, zroot/tmp, is almost empty. It’s mounted at /tmp.
This dataset is the traditional temporary directory.
ZFS Partitions and Properties
ZFS lacks traditional partitions. A partition is a logical subdivision of a
disk, filling very specific Logical Block Addresses (LBAs) on a storage
device. Partitions have no awareness of the data on the partition. Changing
a partition means destroying and (presumably) rebuilding the filesystem on
top of it.
Lucas’ first thought on seeing a partition-less filesystem was to wonder
how he would manage his storage, at all. That’s roughly equivalent to the
confusion he experiences when, after a long cold Michigan winter, he steps
outside and feels natural warm air for the first time in months. Confusion is
part of liberation. We learned to administer storage via partitions because
we had to, not because partitions are pleasant or because they’re the best
solution. Running a traditional filesystem without partitions is poor
practice, but ZFS is not a traditional filesystem.
ZFS tightly integrates the filesystem and the lower storage layers. This
means it can dynamically divide storage space between the various
filesystems as needed. While you can set specific size limits on a ZFS
filesystem, datasets do not have traditional sizes. If the pool has enough
space for a file, you can use it. Where you previously allocated a limited
amount of disk space to, say, /var/log, and thus kept berserk logs from
filling your disk, you must now set those limits at the ZFS level.
The amount of space a dataset may use is one example of a ZFS
property. ZFS supports dozens of dataset properties—for example, the
quota property controls how large a dataset can grow. Use zfs(8) to set a
ZFS property.
View all of a dataset’s properties with zfs get all and the ZFS dataset
name.
Chapter 4 explores ZFS properties in detail, while Chapter 6 discusses
restricting dataset size.
ZFS Limits
Filesystems have always had maximum sizes and limits. The FAT
filesystem we all know and cringe over has required multiple revisions, in
part to overcome its maximum size of 32 MB, then 2 GB, then 4 GB.
FAT32’s 2 TB limit is starting to look a little cramped these days. UFS and
ext2/3/4fs have had their own, similarly arbitrary, limits. These limits exist
because the filesystem authors had to set a limit somewhere, and chose
values that they expected to be good for the next several years. A popular
filesystem will remain in use until those limits are reached, however, so
systems administrators have needed to repeatedly cope with them.
ZFS advocates claim that ZFS is immune to these arbitrary limits, but
that’s not quite true. ZFS uses 128 bits to store most of its values, which set
the limits so high that they won’t ever be encountered by anyone working
in systems administration today. One directory can have 248 files, of up to
16 exabytes each. A single pool can be up to 256 zettabytes, or 278 bytes. A
storage pool can contain up to 264 devices, and a single host can have up to
264 storage pools.
The good news is, we will not live long enough to hit these limits. The
bad news is, we have all the expertise in migrating between filesystems.
When technology hits ZFS’ limits, those poor people won’t be accustomed
to migrating between filesystems. Fortunately, they’ll have a few lingering
ongoing FAT/UFS/extfs rollovers for practice.
Storage Pools
ZFS uses storage pools rather than disks. A storage pool is an abstraction
atop the underlying storage providers, letting you separate the physical
medium and the user-visible filesystem on top of it.
Use zpool(8) to view and manage a system’s storage pools. Here’s the
pool from a default FreeBSD system.
# zpool status
pool: zroot
state: ONLINE
scan: none requested
config:
You get the pool’s name and state first. Systems can have more than one
ZFS pool—large systems, with dozens and dozens of hard drives, often
have multiple pools. If this host had multiple storage pools, each would
appear in a separate description like the sample above.
ZFS can perform many sorts of integrity checks on storage pools. The
scan statement shows if any integrity check is being performed and the
result of the most recent scan.
The last part of the pool list shows the layout of the virtual devices in
the pool.
Virtual Devices
A storage pool contains one or more virtual devices, or VDEVs. A VDEV is
similar to a traditional RAID device. A big RAID-5 presents itself to the
filesystem layer as a single huge device, even though the sysadmin knows
it’s really a whole bunch of smaller disks. Virtual devices let you assign
specific devices to specific roles. With VDEVs you can arrange the
physical storage as needed.
The virtual device is where a whole bunch of ZFS’ magic happens. A
pool can be arranged for RAID-style redundancy. You can use providers as
dedicated read and write caches, improving the virtual device’s
performance. Chapter 2 covers virtual devices in more depth.
ZFS’ data redundancy and automated error correction also take place at
the VDEV level. Everything in ZFS is checksummed for integrity
verification. If your pool has sufficient redundancy, ZFS is self-healing. If
your pool lacks redundancy, well, at least you know the data is damaged
and you can (hopefully) restore from backup.1
The zpool status command that displays the health of a pool also
shows the virtual devices in that pool. Look at the example in the previous
section. This very simple pool, zroot, contains a single storage provider, /
dev/gpt/zfs0. This provider is a GPT partition, not a disk. ZFS can use all
1 ZFS does not eliminate the need for backups. The only thing that eliminates backups is
absolute indifference.
Chapter 2: Virtual Devices
In this chapter we’ll delve into how the sausage is made. This... is a pig—I
mean, a disk. Disks are the physical manifestation of storage. Disks are
evil. They lie about their characteristics and layout, they hide errors, and
they fail in unexpected ways. ZFS means no longer having to fear that your
disks are secretly plotting against you. Yes, your disks are plotting against
you, but ZFS exposes their treachery and puts a stop to it.
To most effectively use the available disks with ZFS, you require a basic
understanding of how the operating system presents disks to ZFS, and how
ZFS arranges data on those disks.
Disks and Other Storage Media
ZFS can also run on storage media other than disks. Anything that is a
FreeBSD GEOM storage provider can become a ZFS storage medium. ZFS
even has support for using files as the backing storage, which is really great
for testing but is not meant for production. ZFS can use any block device
for its physical storage, but each type has its advantages and disadvantages.
Raw Disk Storage
Using an entire physical disk reduces complexity. Also, there is no
partitioning to worry about, and no software or configuration between ZFS
and the physical disk. However, the disadvantages usually outweigh these
advantages.
Booting from a disk requires that the disk have a boot loader. A boot
loader can only go on a partitioned disk. You cannot boot a raw disk.
FreeBSD supports giving disks useful labels, but those labels live inside the
partition information.
Worse, any replacement disks must be exactly the same size as the
original disk, or larger. Not all 6 TB disks are the same size—disks from
different vendors vary by a few megabytes. You don’t care about these
variances when setting up a system, but they’re critical when replacing a
disk. Most catalogs don’t list the number of sectors in each disk, only the
size, so finding a usable replacement can take several attempts. Replacing a
drive that uses the traditional 512-byte sectors with one that uses 4096-byte
(4K, also known as Advanced Format) sectors complicates things further.
The original drive probably had a number of sectors not evenly divisible by
8. Thanks to the special math used by disk drives, the new drive might
appear to be just a couple bytes smaller than the old drive even if it’s a
couple bytes larger.
Partition Storage
Instead of using an entire raw disk, you can partition a disk and then use
one of the partitions for ZFS. The biggest advantage to this is that you can
now boot from the disk that contains the ZFS partition, by creating a small
boot partition, instead of requiring a separate boot device. Using partitions
also allows you to use part of the disk space for other things, like a raw
swap partition, another filesystem, or just leaving some wiggle room at the
end of the disk so the replacement disk doesn’t have to have a matching
sector count. Partitioning also allows you to “short stroke” the drive to
increase performance.
Many of the original Solaris ZFS administration guides recommend
against using partitions (or, in Solaris terms, slices) for performance
reasons. In Solaris, using a partition for a filesystem disables the write
cache. In FreeBSD, disabling the write cache is completely separate from
disk partitioning or filesystems. FreeBSD gives full performance when
using ZFS on a partition.
FreeBSD supports a number of partitioning schemes, but GPT is
strongly recommended. The older partitioning system, MBR, limited the
number of primary partitions to four, while GPT supports up to 128
partitions. MBR can manage disks up to only 2 TB, while GPT can manage
up to 8 ZB with 512 byte-sector disks and up to 64 ZB with 4 K-sector
disks. FreeBSD Mastery: Storage Essentials covers FreeBSD’s support for
both partitioning methods.1
The disadvantage to using partitions is that you might lose some of the
portability that ZFS provides. If you move disks from one system to
another, the target system must be able to recognize the disk partitions.
GEOM Device Storage
ZFS can also use the various FreeBSD GEOM classes as its backing
storage. These sit between the filesystem and the physical devices, and
perform various functions. The GEOM classes provide features such as
whole disk encryption (GELI, GBDE), high availability, labels, multipath,
and pluggable schedulers. A GEOM class can be created based on an entire
device, or on top of another GEOM class, such as a partition, multipath
device, or encrypted disk.
GELI (the FreeBSD disk encryption subsystem) is the best way to
achieve an encrypted ZFS pool. GELI encrypts and decrypts blocks as they
are passed back and forth between ZFS and the physical disks, so it doesn’t
require ZFS to do anything different. GELI supports a number of different
encryption algorithms, but the default AES-XTS offers the best
performance, especially with a modern CPU that supports the AES New
Instructions (AES‑NI). With the help of this hardware offload feature,
GELI can encrypt data at over 1 GB/sec and decrypt even faster, meaning
that adding encryption will not lower your performance, even on an SSD.
GELI can also optionally provide data authentication (integrity
verification), where it stores a Hashed Message Authentication Code
(HMAC) with each sector. It uses this HMAC to verify the integrity (the
data has not been tampered with), and authenticity (this data was written by
you) of the data. If upon reading back the sector, the HMAC does not
verify the data, an error is returned. The HMAC feature is not enabled by
default, and is probably overkill for ZFS because ZFS provides its own
checksumming on each data block.
High Availability Storage Technology (HAST) is FreeBSD’s distributed
storage solution. It allows you to mirror a block device between computers
over the network. Using HAST as the backing storage for a ZFS pool
allows you to mirror each backing disk to a second machine. The advantage
to HAST is that it is real time; a block is not considered to be written until
it has been written to all hosts in the HAST cluster. ZFS replication, on the
other hand, is based on syncing periodic snapshots. However, with HAST
the second machine cannot have the pool imported or mounted at the same
time as the first machine. Compared to ZFS replication, where you can
have the replicated pool active (but read-only) concurrently, HAST makes
sense in only a few cases.
GEOM labels provide a handy way to attach a meaningful note to each
disk or partition. There are many label types, including standards like disk
ident, gptid, GPT labels, and the GEOM-specific glabel. Best practices for
labeling drives appear in Chapter 0.
GEOM also supports multipath for high availability. Sometimes it is not
just the disk that dies, but also the controller card, the backplane, or the
cable. With multipath, enterprise drives that are “dual ported” can be
connected to more than one HBA (a disk controller card without any RAID
features). If each drive has a path to two different storage controllers, it can
survive the loss of one of those controllers. However, when each disk is
connected to two different controllers, the operating system sees each disk
twice, once via each controller. The GEOM multipath class allows you to
write a label to each disk, so that successive routes to the same disk are
detected as such. This way you get one representation of each disk, backed
by multiple paths to that disk via different controllers. We discuss
multipath in FreeBSD Mastery: Advanced ZFS.
The GEOM scheduler module allows the administrator to specify
different I/O scheduling algorithms in an attempt to achieve better
performance. As of this writing, the currently available schedulers are “as,”
a simple form of anticipatory scheduling with only one queue, and “rr,”
anticipatory scheduling with round-robin service across each client queue.
See gsched(8) for more details. The GEOM system makes it relatively
easy to write additional scheduling modules for specific workloads.
File-Backed Storage
You can use a file-backed virtual disk as a ZFS storage device. While we
certainly don’t recommend this for production, file-backed disks can be
useful for testing and experimenting.
Providers vs. Disks
“Provider” is a technical term in FreeBSD. A GEOM storage provider is a
thing that offers data storage. It might be a disk. It might be a GEOM class
that transforms the storage in some way. Technically speaking, this book
should use the word provider instead of disk almost everywhere. You can
use any GEOM provider as a back end for ZFS. The problem with this is,
one physical disk can offer several different providers. Your pool might
have several different providers, but if they’re all on one disk, you’ve just
shot your redundancy in the head.2
Where this book discusses “disks,” we mean “some sort of provider on
top of a disk.” This disk doesn’t have to be wholly dedicated to ZFS—you
could have a swap partition and a ZFS partition on a disk and be perfectly
fine. But you can’t have two ZFS partitions on a single physical disk,
mirror them, and have physical redundancy.
VDEVs: Virtual Devices
A virtual device, or VDEV, is the logical storage unit of ZFS. Each VDEV
is composed of one or more GEOM providers. ZFS supports several
different types of VDEV, which are differentiated by the type of
redundancy the VDEV offers. The common mirrored disk, where each disk
contains a copy of another disk, is one type of VDEV. Plain disks, with no
redundancy, are another type of VDEV. And ZFS includes three different
varieties of sophisticated RAID, called RAID-Z.
These VDEVs are arranged into the storage pools discussed in Chapter
3. Actual data goes on top of the pools, as Chapter 4 covers. But the
arrangement of your virtual devices dictates how well the pool performs
and how well it resists physical damage. Almost all of ZFS’ redundancy
comes from the virtual devices.
A storage pool consists of one or more VDEVs where the pool data is
spread across those VDEVs with no redundancy. (You can add some
redundancy with the copies property, as discussed in Chapter 4, but that
provides no protection against total disk failure.) The ZFS pool treats
VDEVs as single units that provide storage space. Storage pools cannot
survive the loss of a VDEV, so it’s important that you either use VDEVs
with redundancy or decide in advance that it’s okay to lose the data in this
pool.
Using multiple VDEVs in a pool creates systems similar to advanced
RAID arrays. A RAID-Z2 array resembles RAID-6, but a ZFS pool with
two RAID-Z2 VDEVs resembles RAID-60. Mirrored VDEVs look like
RAID-1, but groups of them resemble RAID-10. In both of these cases,
ZFS stripes the data across each VDEV with no redundancy. The
individual VDEVs provide the redundancy.
VDEV Redundancy
A VDEV that contains more than one disk can use a number of different
redundancy schemes to provide fault tolerance. Nothing can make a single
disk sitting all by itself redundant. ZFS supports using mirrored disks and
several parity-based arrays.
ZFS uses redundancy to self-heal. A VDEV without redundancy doesn’t
support self-healing. You can work around this at the dataset layer (with
the copies property), but a redundant VDEV supports self-healing
automatically.
Stripe (1 Provider)
A VDEV composed of a single disk is called a stripe, and has no
redundancy. As you might expect, losing the single provider means that all
data on the disk is gone. A stripe pool contains only single-disk VDEVs.
A ZFS pool stripes data across all the VDEVs in the pool and relies on
the VDEV to provide redundancy. If one stripe device fails, the entire pool
fails. All data stored on the pool is gone. This is fine for scratch partitions,
but if you care about your data, use a type of VDEV that offers fault
tolerance.
Mirrors (2+ Providers)
A mirror VDEV stores a complete copy of all data on every disk. You can
lose all but one of the drives in the provider and still access your data. You
can use any number of disks in a mirror.
Mirrors provide very good random and sequential read speeds because
data can be read from all of the disks at once. Write performance suffers
because all data must be written to all of the disks, and the operation is not
complete until the slowest disk has finished.
RAID-Z1 (3+ Providers)
ZFS includes three modern RAID-style redundant VDEVs, called RAID-Z.
RAID-Z resembles RAID-5, but includes checksumming to ensure file
integrity. Between checksums and ZFS’ copy-on-write features (Chapter
7), RAID-Z insures that incomplete writes do not result in an inconsistent
filesystem.
RAID-Z spreads data and parity information across all of the disks. If a
provider in the RAID-Z dies or starts giving corrupt data, RAID-Z uses the
parity information to recalculate the missing data. You might hear that
RAID-Z uses a provider to store parity information, but there’s no single
parity provider—the parity role is rotated through the providers, spreading
the data.
A RAID-Z1 VDEV can withstand the failure of any single storage
provider. If a second provider fails before the first failed drive is replaced,
all data is lost. Rebuilding a disk array from parity data can take a long
time. If you’re using large disks—say, over 2 TB—there’s a non-trivial
chance of a second drive failing as you repair the first drive. For larger
disks, you should probably look at RAID-Z2.
RAID-Z2 (4+ Providers)
RAID-Z2 resembles RAID-Z1, but has two parity disks per VDEV. Like
RAID-6, RAID-Z2 allows it to continue to operate even with two failed
providers. It is slightly slower than RAID-Z1, but allows you to be
somewhat lazy in replacing your drives.
RAID-Z3 (5+ Providers)
The most paranoid form of RAID-Z, RAID-Z3 uses three parity disks per
VDEV. Yes, you can have three failed disks in your five-disk array. It is
slightly slower than RAID-Z2. Failure of a fourth disk results in total data
loss.
RAID-Z Disk Configurations
One important thing to remember when using any version of RAID-Z is
that the number of providers in a RAID-Z is completely fixed. You cannot
add drives to a RAID-Z VDEV to expand them. You can expand the pool
by adding VDEVs, but you cannot expand a VDEV by adding disks. There
are no plans to add this feature.
Suppose you have a host that can accept 20 hard drives. You install 12
drives and use them as a single RAID-Z2, thinking that you will add more
drives to your pool later as you need them. Those new drives will have to
go in as separate RAID-Z2 VDEV.
What’s more, your VDEVs will be unbalanced. Your pool will have a
single 12-drive VDEV, and a second 8-drive VDEV. One will be slower
than the other. ZFS will let you force it to pool these devices together, but
it’s a really bad idea to do so.
Plan ahead. Look at your physical gear, the number of drives you have
to start with, and how you’ll expand that storage. Our example server
would be fine with on pool containing a single RAID-Z2 VDEV, and a
completely separate pool containing the other eight disks in whatever
arrangement you want. Don’t cut your own throat before you even start!
The RAID-Z Rule of 2s
One commonly discussed configuration is to have a number of data disks
equal to a multiple of two, plus the parity disks needed for a given RAID-Z
level. That is, this rule says that a RAID-Z1 should use 2n+1 disks, or
three, five, seven, nine, and so on. A RAID-Z2 should use 2n+2 disks (four,
six, eight, and so on), while a RAID-Z3 should use 2n+3 (five, seven, nine,
and so on).
This rule works—if and only if your data is composed of small blocks
with a size that is a power of 2. Other factors make a much bigger
difference, though. Compression is generally considered far more effective.
Compressing your data reduces the size of the blocks, eliminating this
benefit.
Repairing VDEVs
When a provider that belongs to a redundant VDEV fails, the VDEV it is a
member of becomes “degraded.” A degraded VDEV still has all of its data,
but performance might be reduced. Chapter 5 covers replacing failed
providers.
After the provider is replaced, the system must store data on the new
provider. Mirrors make this easy: read the data from the remaining disk(s)
and write it to the replacement. For RAID-Z, the data must be recalculated
from parity.
The way that ZFS combines RAID and the filesystem means that ZFS
knows which blocks contain data, and which blocks are free. Instead of
having to write out every byte on to the new drive, ZFS needs to write only
the blocks that are actually in use. A traditional RAID controller has no
understanding or awareness of the filesystem layer, so it has no idea what is
in use and what is free space. When a RAID controller replaces a disk, it
must copy every byte of the new disk. This means a damaged ZFS RAID-Z
heals much more quickly, reducing the chance of a concurrent failure that
could cause data loss. We discuss ZFS recovery in Chapter 5.
RAID-Z versus traditional RAID
RAID-Z has a number of advantages compared to traditional RAID, but the
biggest ones come from the fact that ZFS is the volume manager and the
filesystem in addition to the disk redundancy layer.
Back in the day, filesystems could only work on one disk. If you had
two disks, you needed two separate filesystems. Traditional RAID let you
combine multiple disks into one virtual disk, permitting the creation of
massive disks as large as 100 MB, or even bigger! Then the operating
system puts its own filesystem on top of that, without any understanding of
how the blocks will be laid out on the physical disks. At the same time,
RAID could provide fault tolerance. Given the limitations of hardware and
software at the time, RAID seemed a pretty good bet.
By combining the filesystem and the volume manager, ZFS can see
exactly where all data lies and how the storage layer and the data interact.
This allows ZFS to make a number of important decisions, such as ensuring
that extra copies of important data such as ditto blocks (Chapter 3) are
stored on separate disks. It does no good to have two or three copies of
your critical data all on one underlying storage provider that can be wiped
out by a single hardware failure. ZFS goes so far as to put the ditto blocks
on adjacent disks, because it is statistically less likely that if two disks fail
concurrently, they will be neighbors.
Traditional RAID can suffer from a shortcoming known as the “write
hole,” where two-step operations get cut short halfway through. RAID 5
and 6 devices chunk up data to be written to all of the data disks. Once this
operation finishes, a parity block is calculated and stored on the parity disk.
If the system crashes or the power is cut after the data is written but before
the parity is written, the disk ends up in an indeterminate state. When the
system comes back up, the data does not match the parity. The same thing
can happen with mirrored drives if one drive finishes updating and the
other does not.
Write hole problems are not noticed until you replace a failed disk. The
incorrect parity or incorrect mirror results in the RAID device returning
garbage data to the filesystem. Traditional filesystems return this garbage
data as the contents of your file.
ZFS solves these problems with copy-on-write and checksums. Copy-
on-write (Chapter 7) means data is never overwritten in place. Each update
is transactional, and either completes fully or is not performed, returning
the system to the state it was in before the update. ZFS also has checksums,
so it can detect when a drive returns invalid data. When ZFS detects invalid
data it replaces that data with the correct data from another source, such as
additional copies of the data, mirrored drives, or RAID-Z parity.
Combined, these create ZFS’ self-healing properties.
Special VDEVs
Pools can use special-purpose VDEVs to improve the performance of the
pool. These special VDEV types are not used to persistently store data, but
instead temporarily hold additional copies of data on faster devices.
Separate Intent Log (SLOG, ZIL)
ZFS maintains a ZFS Intent Log (ZIL) as part of the pool. Similar to the
journal in some other filesystems, this is where it writes in-progress
operations, so they can be completed or rolled back in the event of a system
crash or power failure. The ZIL is subject to the disk’s normal operating
conditions. The pool might have a sudden spike in use or latency related to
load, resulting in slower performance.
One way to boost performance is to separate the ZIL from the normal
pool operations. You can use a dedicated device as a Separate Intent Log,
or SLOG, rather than using a regular part of the pool. The dedicated device
is usually a small but very fast device, such as a very high-endurance SSD.
Rather than copying data from the SLOG to the pool’s main storage in
the order it’s received, ZFS can batch the data in sensible groups and write
it more efficiently.
Certain software insists on receiving confirmation that data it writes to
disk is actually on the disk before it proceeds. Databases often do this to
avoid corruption in the event of a system crash or power outage. Certain
NFS operations do the same. By writing these requests to the faster log
device and reporting “all done,” ZFS accelerates these operations. The
database completes the transaction and moves on. You get write
performance almost at an SSD level, while using inexpensive disk as the
storage media.
You can mirror your ZIL to prevent data loss.
Cache (L2ARC)
When a file is read from disk, the system keeps it in memory until the
memory is needed for another purpose. This is old technology, used even
back in the primordial BSD days. Look at top(1) on a UFS-based BSD
system and you’ll see a chunk of memory labeled Buf. That’s the buffer
cache.
The traditional buffer cache was designed decades ago, however. ZFS
has an Adaptive Replacement Cache, or ARC, designed for modern
hardware, that gives it more speed. The ARC retains the most recently and
frequently accessed files.
Very few modern systems have enough RAM to cache as much as they
want, however. Just as ZFS can use a SLOG to accelerate writes, it can use
a very fast disk to accelerate reads. This is called a Level 2 ARC, or
L2ARC.
When an object is used frequently enough to benefit from caching, but
not frequently enough to rate being stored in RAM, ZFS can store it on a
cache device. The L2ARC is typically a very fast and high-endurance SSD
or NVMe device. Now, when that block of data is requested, it can be read
from the faster SSD rather than the slower disks that make up the rest of the
pool. ZFS knows which data has been changed on the back-end disk, so it
can ensure that the read cache is synchronized with the data on the storage
pool.
How VDEVs Affect Performance
Each different type of VDEV performs differently. Benchmarking and
dissecting disk performance is a complex topic that would merit a great big
textbook, if anyone would be bothered to read it. Any specific advice we
were to give here would quickly become obsolete, so let’s just discuss
some general terms.
One common measurement is Input/Output Per Second or IOPS, the
number of distinct operations the drive can perform each second. Spinning
drive IOPS are usually physically limited by how quickly the read/write
head can move from place to place over the platter. Solid state disks have
such excellent performance because they don’t need to physically move
anything.
The number of non-parity spindles constrains streaming read and write
performance of an undamaged pool. “Streaming” performance boils down
to the number of megabytes per second (MB/s) the drive can read or write.
When a drive reads or writes data sequentially, the heads do not have to
seek back and forth to different locations. It is under these conditions that a
drive will achieve its best possible streaming performance, giving the
highest throughput. Spindle count affects both random and streaming
performance. An array of 12 one-terabyte (12 x 1 TB) drives usually
outperforms an array of six two-terabyte (6 x 2 TB) drives because the
greater spindle and head counts increase both IOPS and streaming
performance. Having more heads means that ZFS can be reading from, or
writing to, more different locations on the disks at once, resulting in greater
IOPS performance. More spindles mean more disks working as fast as they
can to read and write your data. The greater number of drives require a
larger shelf or chassis, more power, and more controllers, however.
Other common measurements include read bandwidth, write bandwidth,
space efficiency, and streaming performance.
Generally speaking, mirrors can provide better IOPS and read
bandwidth, but RAID-Z can provide better write bandwidth and much
better space efficiency.
A pool with multiple VDEVs stripes its data across all the VDEVs. This
increases performance but might cost space, as each individual VDEV has
its own redundant disks. A pool with multiple VDEVs probably has
increased reliability and fault tolerance. While ZFS’ redundancy is all at
the VDEV level, a pool with multiple redundant VDEVs can probably
withstand more disk failures. The more VDEVs you have in a pool, the
better the pool performs.
Let’s go through some common VDEV configurations and see how the
various possible arrangements affect performance and capacity. Assume
we’re using a set of modest commodity 1 TB spinning disks. Each disk is
capable of 250 IOPS and streaming read/writes at 100 MB/s.
One Disk
With only one disk, there is only one possible configuration, a single ZFS
stripe VDEV. This is the most basic configuration, and provides no fault
tolerance. If that one disk dies, all of your data is gone.
Table 1: Single Disk Virtual Device Configurations
Config
Read
Write
Usable
Fault
DisksMB/s
IOPS
Tolerance
IOPS
MB/s
Space
1Stripe
100
250
none
TB (100%)
Fault
Config
Read
Write
Usable
DisksMB/s
IOPS
Tolerance
IOPS
MB/s
Space
2 TB
200
500
none
x Stripe
(100%)
100
200
250
1500
2 TB
x 2 disk
(50%)Mirror
As the table shows, our mirror pool gets half the write performance of
the striped pool and has half the space.
Three Disks
Three disks means more options, including a deeper mirror and RAID-Z.
You could also use a pool of three stripe disks, but the odds of failure are
much higher.
A deeper mirror has more disks, providing more fault tolerance and
improved read performance. More spindles and heads mean that the VDEV
can read data from the least busy of the three drives, serving random reads
more quickly. Write performance in a mirror is still limited to the slowest
of the drives in the mirror.
RAID-Z1 offers better space efficiency, as the fault tolerance requires
only one of the disks in the VDEV. Data is spread across all of the drives,
so they must work together to perform reads and writes. Spreading the data
across all the drives improves streaming write performance. Unlike a
mirror, in RAID-Z all drives can write their share of data simultaneously,
instead of each drive writing identical data.
Table 3: Three-Disk Virtual Device Configurations
Fault
Config
Read
Write
Usable
DisksMB/s
IOPS
Tolerance
IOPS
MB/s
Space
13 TB
100
300
250
2750x 3 disk
(33%)Mirror
23 TB
200
1250
x 3 disk
(66%)RAID-Z1
Fault
Config
Read
Write
Usable DisksMB/s
IOPS
Tolerance
IOPS
MB/s
Space
200
400
500
21000 4 (1/VDEV)
x 2 disk
TB (50%)Mirror
34 TB
300
1250 x 4 disk
(75%)RAID-Z1
14 TB
200
2250 x 4 disk
(50%)RAID-Z2
45 TB
400
1250 x 5 disk
(80%)RAID-Z1
315 TB
300
2250 x 5 disk
(60%)RAID-Z2
215 TB
200
3250
x 5 disk
(40%)RAID-Z3
Note how the streaming (MB/s) read and write performance of RAID-
Z1 compares with RAID-Z2, and how the performance of RAID-Z3
compares to both. Adding a parity disk means sacrificing that disk’s
throughput.
The fault tolerance of multiple mirror VDEVs is slightly tricky.
Remember, redundancy is per-VDEV, not per pool. Each mirror VDEV
still provides n - 1 fault tolerance. As long as one drive in each mirror
VDEV still works, all data is accessible. With two two-disk mirror VDEVs
in your pool, you can lose one disk from each VDEV and keep running. If
you lose two disks from the same VDEV, however, the pool dies and all
data is lost.
Six to Twelve Disks
With large numbers of disks, the decision shifts to balancing fault
tolerance, space efficiency, and performance.
Six disks could become three two-disk mirror VDEVs, giving you a fair
amount of space and good write performance. You could opt for a pair of
three-disk mirror VDEVs, giving you less space, but allowing two disks out
of each set of three to fail without risking data loss. Or they could become a
RAID-Z VDEV.
Get many more than six disks and you can have multiple RAID-Z
VDEVs in a pool. A dozen disks can be operated together as a single
VDEV giving the most available space, or can be split into two separate
VDEVs, providing less usable space but better performance and more fault
tolerance.
Table 5: Six- to Twelve-Disk Virtual Device Configurations
Fault
Config
Read
Write
Usable DisksMB/s
IOPS
Tolerance
IOPS
MB/s
Space
300
600
750
31500 6 (1/VDEV)
x 2 disk
TB (50%)Mirror
26 (2/VDEV)
200
600
500
41500 x 3 disk
TB (33%)Mirror
56 TB
500
1250 x 6 disk
(83%)RAID-Z1
416 TB
400
2250 x 6 disk
(66%)RAID-Z2
16 TB
300
3250x 6 disk
(50%)RAID-Z3
600
1200
1500
63000
12(1/VDEV)
x 2 disk
TB (50%)Mirror
4812(2/VDEV)
400
1200
1000
3000 x 3 disk
TB (33%)Mirror
11
1100
1250 12xTB
12 disk
(92%)RAID-Z1
10
1000
2500 12(1/VDEV)
xTB
6 disk
(83%)
RAID-Z1
912(1/VDEV)
900
3750 x 4 disk
TB (75%)RAID-Z1
112xTB
10
1000
2250 12-disk
(83%)RAID-Z2
8212(2/VDEV)
800
4500 x 6-disk
TB (66%)RAID-Z2
9112TB
900
3250x 12-disk
(75%)RAID-Z3
212(3/VDEV)
6TB
600
6500
x 6-disk
(50%)RAID-Z3
Config
Read
Write
Usable
Fault
DisksMB/s
IOPS
Tolerance
IOPS
MB/s
Space
18
1800
3600
4500
900036 (1/VDEV)
x 2 disk
TB (50%)Mirror
12
1200
3600
3000
9000
24 36 (2/VDEV)
x 3 disk
TB (33%)Mirror
136xTB
34
3400
2250 36 disk
(94%)RAID-Z2
236(2/VDEV)
32
3200
4500 xTB
18 disk
(89%)RAID-Z2
436(2/VDEV)
28
2800
81000
xTB
9 disk
(78%)
RAID-Z2
636x(2/VDEV)
24
2400
1500
12 TB
6 disk
(66%)
RAID-Z2
By using more VDEVs, you can create screaming fast pools. A pool of
18 two-disk mirror VDEVs can read data more quickly than most anything
else—and it can lose 18 drives before failing! Yes, they have to be the right
18 drives, but if you have two disk shelves with different power supplies,
that’s entirely possible. On the other hand, if the wrong two disks in that
pool fail, your entire pool dies.
Adding parity or mirrors to each VDEV increases reliability. A greater
number of VDEVs increases performance. Your job is to juggle these two
characteristics to support your environment.
Each VDEV is limited to the random read/write performance of the
slowest disk, so if you have too many disks in one VDEV, you are
surrendering performance for only a small gain in space efficiency. While
you can add L2ARC and SLOG devices to improve performance, it’s best
to avoid these problems altogether.
So if more VDEVs are always better, why is the 6 x 6 disk RAID-Z2
pool so much slower at reading and writing compared to the 1 x 36 disk
RAID-Z2 pool? The answer lies in the fault tolerance column. When you
have more RAID-Z2 VDEVs, you have more redundancy, and you can
survive more failures. When a disk is providing fault tolerance, it is storing
an extra copy of your data, so it can replace a copy that is lost when a disk
fails. The system recalculates and stores parity data every time the data
changes. Parity data isn’t used when reading files unless the original copy
is missing. The disks used for parity no longer contribute to streaming
performance. You can restore that performance by adding more disks. A 6
x 8 disk RAID-Z2 pool would have the equivalent of 36 data disks and 12
parity disks, and be able to outperform the 1 x 36 disk RAID-Z2 pool.
Let’s take what you know about VDEVs, and create some actual pools
with them.
1 If you’re storing your data on clay tablets, you may use bsdlabel(8) partitions.
2 FreeBSD’s flexible storage system gives you the power to do stupid things. Don’t.
Chapter 3: Pools
ZFS pools, or zpools, form the middle of the ZFS stack, connecting the
lower-level virtual devices to the user-visible filesystem. Pools are where
many filesystem-level tasks happen, such as allocating blocks of storage.
At the ZFS pool level you can increase the amount of space available to
your ZFS dataset, or add special virtual devices to improve reading or
writing performance.
ZFS Blocks
Traditional filesystems such as UFS and extfs place data on the disk in
fixed-size blocks. The filesystem has special blocks, called inodes, that
index which blocks belong to which files. Even non-Unix filesystems like
NTFS and FAT use similar structures. It’s a standard across the industry.
ZFS does not pre-configure special index blocks. It only uses storage
blocks, also known as stripes. Each block contains index information to
link the block to other blocks on the disk in a tree. ZFS computes hashes of
all information in the block and stores the information in the block and in
the parent block. Each block is a complete unit in and of itself. A file might
be partially missing, but what exists is coherent.
Not having dedicated special index blocks sounds great, but surely ZFS
needs to start somewhere! Every data tree needs a root. ZFS uses a special
block called an uberblock to store a pointer to the filesystem root. ZFS
never changes data on the disk—rather, when a block changes, it writes a
whole new copy of the block with the modified data. (We discuss this
copy-on-write behavior in depth in Chapter 6.) A data pool reserves 128
blocks for uberblocks, used in sequence as the underlying pool changes.
When the last uberblock gets used, ZFS loops back to the beginning.
The uberblocks are not the only critical blocks. ZFS copies blocks
containing vital information like filesystem metadata and pool data into
multiple ditto blocks. If a main block is damaged, ZFS checks the ditto
block for a backup copy. Ditto blocks are stored as far as possible from
each other, either on separate disks or on separate parts of a single disk.
(ZFS has no special ability to see the layout of the disk hardware, but it
makes a valiant guess.)
ZFS commits changes to the storage media in transaction groups, or txg.
Transaction groups contain several batched changes, and have an
incrementing 64-bit number. Each transaction group uses the next
uberblock in line. ZFS identifies the most current uberblock out of the
group of 128 by looking for the uberblock with the highest transaction
number.
ZFS does use some blocks for indexing, but these znodes and dnodes
can use any storage block in the pool. They aren’t like UFS2 or extfs index
nodes, assigned when creating the filesystem.
Stripes, RAID, and Pools
You’ve certainly heard the word stripe in connection with storage,
probably many times. A ZFS pool “stripes” data across the virtual devices.
A traditional RAID “stripes” data across the physical devices. What is a
stripe, and how does it play into a pool?
A stripe is a chunk of data that’s written to a single device. Most
traditional RAID uses a 128 KB stripe size. When you’re writing a file to a
traditional RAID device, the RAID software writes to each drive in 128 KB
chunks, usually in parallel. Similarly, reads from a traditional RAID array
take place in increments of the stripe size. While you can customize the
stripe size to fit a server’s workload, the hardware’s capacity and the
software’s limitations greatly restrict stripe size.
Stripes do not provide any redundancy. Traditional RAID gets its
redundancy from parity and/or mirroring. ZFS pools get any redundancy
from the underlying VDEVs.
ZFS puts stripes on rocket-driven roller skates. A ZFS dataset uses a
default stripe size of 128 KB, but ZFS is smart enough to dynamically
change that stripe size to fit the equipment and the workload. If a 32 KB
stripe size makes sense for a particular chunk of data, but 64 KB makes
sense for another piece of data, ZFS uses the appropriate size for each one.
The ZFS developers have completed support for stripe sizes up to 1 MB.
This feature is already available in FreeBSD-CURRENT, and is expected
to be included in FreeBSD 10.2 and later.
A ZFS pool has much more flexibility than a traditional RAID.
Traditional RAID has a fixed and inflexible data layout (although some
hardware vendors have their own proprietary RAID systems with more
flexibility). The RAID software writes to each disk in a deterministic order.
ZFS has more flexibility. If you have a five-disk traditional RAID array,
that array will always have five disks. You cannot change the array by
adding disks. While you might be able to exchange the disks for larger
disks, doing so won’t change the array’s size. Creating a RAID device
petrifies the array’s basic characteristics.
ZFS pools not only tolerate changes, but they’re designed to easily
accept additions as well. If you have a ZFS pool with five VDEVs and you
want to add a sixth, that’s fine. ZFS accepts that VDEV and starts striping
data on that device without blinking. You cannot add storage to RAID-Z
VDEVs, only VDEVs to pools. The number of providers in a RAID-Z
VDEV is fixed at creation time.
With ZFS, though, that virtual device can be any type of VDEV ZFS
supports. Take two VDEVs that are mirror pairs. Put them in a single
zpool. ZFS stripes data across them. In traditional RAID, a stripe on top of
mirrors would be called RAID-10. For most use cases, RAID-10 is the
highest-performance RAID you can have. Where traditional RAID-10 has a
fixed size, however, you can add additional VDEVs to a pool. Expanding
your RAID-10 means backing up your data, adding disks to the RAID
array, and restoring the data. Expanding your zpool means adding more
VDEVs to the pool. RAID-10 also allows a depth of up to two disks, where
ZFS allows a depth of up to 264.
Remember, though, that pools do not provide any redundancy. All ZFS
redundancy comes from the underlying VDEVs.
Viewing Pools
To see all of the pools on your system, run zpool list.
# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HE
db 2.72T 1.16G 2.72T - 0% 0% 1.00x ON
zroot 920G 17.3G 903G - 2% 1% 1.00x ON
The first column gives the pool name. This system has two pools, db
and zroot.
The next three columns give size and usage information on each pool.
You’ll get the size, the amount of space used, and the amount of free space.
The EXPANDSZ column shows if the underlying storage providers
have any free space. You might be able to expand the amount of space in
this pool, as discussed in Chapter 5. This space includes blocks that will go
to parity information, so expanding the pool won’t give you this much
usable space.
Under FRAG you’ll see the amount of fragmentation in this pool.
Fragmentation degrades filesystem performance.
The CAP column shows what percentage of the available space is used.
The DEDUP entry shows the amount of deduplication that’s happened
on the filesystem. Chapter 6 covers deduplication.
The pool’s HEALTH column reflects the status of the underlying
VDEVs. If a storage provider fails, your first hint will be any status other
than ONLINE. Chapter 5 discusses pool health.
Finally, the ALTROOT shows where this pool is mounted, or its
“alternate root.” Chapter 4 covers alternate roots.
If you want to know the information for a specific pool or pools, list the
pool names after zpool list. This shows only the output of the storage
pools prod and test.
The -p flag prints numbers in bytes rather than the more human-friendly
format, and -H eliminates the column headers. These options are useful for
automation and management scripts.
For a more detailed view of a system’s pools, including the underlying
VDEV layout, use zpool status. We’ll see lots of examples of zpool
status when we create pools.
# zpool status -x
all pools are healthy
# sysctl [Link].min_auto_ashift=12
Use the command line during installation, but also set it permanently in
/etc/[Link] so you don’t forget when creating new pools.
This book’s examples assume that you’re using FreeBSD 10.1 or newer.
For older FreeBSD versions, you’ll need to set the ashift each time rather
than setting the sysctl.
Older FreeBSD Ashift
FreeBSD versions older than 10.1 lack the ashift sysctl found in newer
FreeBSD versions, so you have to rely on ZFS’ internal sector-size-
detection code. This code reads the sector size from the underlying storage
medium—namely, the storage provider.
This case highlights the critical difference between a provider and a
disk. FreeBSD lets you create a pass-through device with the GEOM
module gnop(8). The gnop module lets you insert arbitrary data between
your storage devices—in this case, enforcing a sector size. You create a
gnop device that says, “Pass everything through transparently, but insist on
a 4096-byte sector size.” Use that device to create your zpool. Here, we add
a gnop device to the partition labeled /dev/gpt/zfs0.
We manage the ZFS pools with the GPT labels, so the examples
reference gpt/zfs0 through gpt/zfs5. In production, use meaningful labels
that map disks to physical locations.
Striped Pools
Some storage pools don’t need redundancy, but do need lots of space.
Scratch partitions for engineering and physics computations are common
use cases for this kind of storage. Use zpool create, the pool name, and
list the devices in the pool. Remember to set the ashift before creating the
pool.
Here we create a pool of five storage providers.
# sysctl [Link].min_auto_ashift=12
# zpool create compost gpt/zfs0 gpt/zfs1 gpt/zfs2 gpt/zfs3
gpt/zfs4
If the command succeeds, you get no output back. See if the pool exists
with zpool status.
# zpool status
pool: compost
state: ONLINE
scan: none requested
config:
All five providers appear. Each provider is its own VDEV. This is a big
pool for a system this size.
This pool stripes data across all the member VDEVs, but the VDEVs
have no redundancy. Most real-world applications require redundancy. The
simplest sort of redundancy is the mirror.
Mirrored Pools
Mirrored devices copy all data to multiple storage providers. If any one
provider on the mirror fails, the pool still has another copy of the data.
Traditional mirrors have two disks, although more is certainly possible.
Use the same zpool create command and the pool name. Before listing
the storage devices, use the mirror keyword. Set the system ashift before
creating the pool.
# sysctl [Link].min_auto_ashift=12
# zpool create reflect mirror gpt/zfs0 gpt/zfs1
# zpool status
pool: reflect
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
reflect ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/zfs0 ONLINE 0 0 0
gpt/zfs1 ONLINE 0 0 0
You can certainly have a mirror with many disks if this fits your needs.
Too many copies are better than not enough.
# zpool create reflect mirror gpt/zfs0 gpt/zfs1 gpt/zfs2
gpt/zfs3
This might be an example of going too far, however (although we do
discuss splitting a mirror into multiple pools in FreeBSD Mastery:
Advanced ZFS.)
RAID-Z Pools
The redundancy you get from mirrors is fast and reliable, but not terribly
complicated or exciting. RAID-Z offers greater flexibility at a complexity
cost, which makes it more exciting.3 Create a RAID-Z pool much as you
would any other zpool: run zpool create and give the pool name, the type,
and the storage devices. Here we create a RAID-Z (or RAID-Z1) pool.
# sysctl [Link].min_auto_ashift=12
# zpool create
raidz1 gpt/zfs0 gpt/zfs1 gpt/zfs2
The new pool’s status shows a new VDEV, called raidz1-0, with three
providers.
If any one disk in this pool fails, the data remains intact. Other RAID-Z
levels have even more redundancy. Here we pull six providers into a
RAID-Z3. The only difference between creating the RAID-Z3 and the
RAID-Z1 is the use of raidz3 and the additional devices needed.
As you might guess by now, the pool’s status shows a new device called
raidz3-0.
# zpool status
pool: bucket
state: ONLINE
scan: none requested
config:
All of these pools have a single VDEV. What if you want multiple
VDEVs, though?
Multi-VDEV Pools
You can create a pool with multiple VDEVs. The keywords mirror, raidz,
raidz2, and raidz3 all tell zpool(8) to create a new VDEV. Any storage
providers listed after one of those keywords goes into creating a new
instance of that VDEV. When one of the keywords appears again,
zpool(8) starts with a new VDEV.
The opening of this chapter covered striping across multiple mirrors,
simulating a traditional RAID-10 setup. Here we do exactly that.
# sysctl [Link].min_auto_ashift=12
# zpool create barrel mirror gpt/zfs0 gpt/zfs1 mirror gp
The first three words, zpool create barrel, tell zpool(8) to instantiate
a new pool, named barrel. The mirror keyword says “create a mirror.” We
then have two storage providers, gpt/zfs0 and gpt/zfs1. These storage
providers go into the first mirror. The word mirror appears again, telling
zpool(8) that the previous VDEV is complete and we’re starting on a new
VDEV. The second VDEV also has two storage providers, gpt/zfs2 and
gpt/zfs3. This pool’s status looks different than anything we’ve seen
before.
The pool has two VDEVs, mirror-0 and mirror-1. Each VDEV includes
two storage devices. We know that ZFS stripes data across all the VDEVs.
Stripes over mirrors are RAID-10.
You can also arrange multi-VDEV pools in ways that have no common
RAID equivalent. While software RAID systems like some of FreeBSD’s
GEOM classes would let you build similar RAIDs, you won’t find them on
a hardware RAID card. Here we create a pool that stripes data across two
RAID-Z1 VDEVs.
# zpool create vat raidz1 gpt/zfs0 gpt/zfs1 gpt/zfs2 raidz1
gpt/zfs3 gpt/zfs4 gpt/zfs5
The first RAID-Z1 VDEV includes three storage providers, gpt/zfs0,
On systems that need high availability, you can mirror these write
caches. Mirroring the read cache doesn’t make much sense—if you lose the
read cache, ZFS falls back to reading from the actual pool. Losing the ZIL
write log can cause data loss, however, so mirroring it makes sense. Here
we create a stripe of two mirrors using devices gpt/zfs0 through gpt/
zfs3, with mirrored log devices gpt/zlog0 and gpt/zlog1.
# zpool create db mirror gpt/zfs0 gpt/zfs1 mirror gpt/zfs2
gpt/zfs3 log mirror gpt/zlog0 gpt/zlog1
You can add intent log and read cache devices to an existing pool, or
remove them. If you’re not sure you need the performance boost of these
devices, try running the pool without them. Make sure that your hardware
has space to add SSD storage devices later, however!
Mismatched VDEVs
Using different VDEV types within a pool is not advisable, and zpool(8)
The zpool(8) command points out the mistake, and then tells you how
to insist. We normally take these kinds of errors as a way of saying the
sysadmin needs more caffeine, but maybe you really intended it. Running
zpool create -f with the specified VDEV types and storage providers tells
ZFS that yes, you fully intended to create a malformed pool. Hey, it’s your
system; you’re in charge.
If ZFS doesn’t want you to do something, you probably shouldn’t.
When you use –f, you’re creating something that ZFS isn’t designed to
handle. You can easily create a pool that won’t work well and cannot be
repaired.
Reusing Providers
We sometimes create and destroy pools more than once to get them right.
We might pull disks from one machine and mount them in another.
Sometimes we encounter a disk that we’ve used before.
We used this disk in another pool, which we later exported (see Chapter
5). The problem disk was used in that pool, and the ZFS label remained on
the disk. While we erased and recreated the partition table, the new
partition table happens to be precisely identical to the previous one. ZFS
easily finds the old metadata in this case.
If you’re absolutely sure this provider doesn’t have anything important
on it, follow the instructions and force creation of the new pool with -f.
<pre># zpool create -f db gpt/zfs1 gpt/zfs2 gpt/zfs3 gpt/
zfs4</pre>
The ZFS programs can be very picky about where your command-line
flags go, so be sure the -f immediately follows create.
Pool Integrity
One common complaint about ZFS is that it has no filesystem checker,
such as fsck(8). An offline file checker wouldn’t improve ZFS because the
online pool integrity checker verifies everything that fsck(8) checks for
and more. The online checker is also much more effective than a traditional
filesystem would ever let fsck(8) be. Let’s talk about how ZFS ensures file
integrity, and then how pool scrubbing helps maintain integrity.
ZFS Integrity
Storage devices screw up. When you have trillions of sectors on any sort of
disk, the odds of a stray cosmic ray striking one hard enough to make it
stagger around drunkenly go way up—as well as the odds of a write error,
or a power failure, or a short in a faulty cable, or any number of other
problems. No filesystem can prevent errors in the underlying hardware.
ZFS uses hashes almost everywhere. A hash is a mathematical
algorithm that takes a chunk of data and computes a fixed-length string
from it. The interesting thing about a hash is that minor changes in the
original data dramatically change the hash of the data. Each block of
storage includes the hash of its parent block, while each parent block
includes the hash of all its children.
While ZFS cannot prevent storage provider errors, it uses these hashes
to detect them. Whenever the system accesses data, it verifies the
checksums. ZFS uses the pool’s redundancy to repair any errors before
giving the corrected file to the operating system. This is called self-healing.
If the underlying VDEVs have redundancy, ZFS either reconstructs the
damaged block from RAID-Z or grabs the intact copy from the mirror. If
both sides of a mirror have errors, ZFS can recover the files so long as the
same data is not bad on both disks. If the VDEV has no redundancy, but a
dataset has extra copies of the data (see Chapter 4), ZFS uses those extra
copies instead.
If the underlying VDEV has no redundancy, and the dataset does not
keep extra copies, the pool notes that the file is damaged and returns an
error, instead of returning incorrect data. You can restore that file from
backup, or throw it away.
While ZFS performs file integrity checks, it also verifies the connections
between storage blocks. This is the task performed by fsck(8) in
traditional filesystems. It’s a small part of data verification, and ZFS
performs this task continually as part of its normal operation. ZFS has an
additional advantage over fsck(8) in that it checks only blocks that actually
exist, rather than used and unused inodes. If you want to perform a full
integrity check on all data in a pool, scrub it.
The nice thing about hash-based integrity checking is that it catches all
sorts of errors, even unexpected ones. Remember, happy filesystems are all
alike; every unhappy filesystem is unhappy in its own way.
Scrubbing ZFS
A scrub of a ZFS pool verifies the cryptographic hash of every data block
in the pool. If the scrub detects an error, it repairs the error if sufficient
resiliency exists. Scrubs happen while the pool is online and in use.
If your pool has identified any data errors, they’ll show up in the zpool’s
status. If you’ve run a scrub before, you’ll also see that information in the
scan line.
...
scan: scrub repaired 0 in 15h57m with 0 errors on Sun Fe
…
errors: No known data errors
...
This pool has encountered no errors in the data it has accessed. If it had
found errors, it would have self-healed them. The pool hasn’t checked all
the data for errors, however—it has checked only the data it’s been asked
for. To methodically search the entire pool for errors, use a scrub. Run
zpool scrub and the pool name.
Scrubs run in the background. You can see how they’re doing by
running zpool status.
# zpool status
...
scan: scrub in progress since Tue Feb 24 [Link] 2015
12.8G scanned out of 17.3G at 23.0M/s, 0h3m to go
0 repaired, 74.08% done
...
A ZFS pool scrubbing its storage runs more slowly than usual. If your
system is already pushing its performance limits, scrub pools only during
off-peak hours. If you must cancel an ongoing scrub, run zpool scrub -s.
Be sure to go back and have the system complete its scrub as soon as
possible.
Scrub Frequency
ZFS’ built-in integrity testing and resiliency mean that most errors are
fixable, provided that they’re found early enough for the resiliency to kick
in. This means that your hardware’s quality dictates how often you should
scrub a host’s pools. If you have reliable hardware, such as so-called
“server grade” gear, scrubbing quarterly should suffice. If you’re abusing
cheap hardware, you should scrub every month or so.
FreeBSD can perform regular scrubs for you, as discussed in “ZFS
Maintenance Automation” later this chapter.
Pool Properties
ZFS uses properties to express a pool’s characteristics. While zpool
properties look and work much like a dataset’s properties, and many
properties seem to overlap between the two, dataset properties have no
relationship to pool properties. Pool properties include facts such as the
pool’s health, size, capacity, and per-pool features.
A pool’s properties affect the entire pool. If you want to set a property
for only part of a pool, check for a per-dataset property that fits your needs.
Viewing Pool Properties
To view all the properties of all the pools on your system, run zpool get
all. You can add a pool name to the end if you want only the properties on
a specific pool. Here we look at the properties for the pool zroot.
The first two columns give the pool name and the name of the property.
The third column lists the value of the property. This can be something
like enabled or disabled, on or off, active or inactive, or it can be a value.
This pool’s size property is 920G—this pool has 920 GB of space.
The SOURCE column shows where this property is set. This can be a
single dash, or the words default or local. A dash means that this property
isn’t set per se, but rather somehow read from the pool. You don’t set the
value for the pool’s size or how much of that space is used. FreeBSD
calculates those values from the pool. A SOURCE of default indicates that
this property is set to its default value, while local means that this property
has been specifically set on this pool.
To get a single property, run zpool get with the property name.
We locally set the comment to the default value, so the value’s source
remains local.
You can set a pool’s properties at creation time with -o. You can set
properties for the root dataset on that pool with -O.
The pool has its altroot property set to /mnt, and the root dataset on
this pool has the canmount property set to off. If a property changes how
data is written, only data written after changing the property is affected.
ZFS won’t rewrite existing data to comply with a property change.
Pool History
Every zpool retains a copy of all changes that have ever been made to the
pool, all the way back to the pool’s creation. This history doesn’t include
routine events like system power-on and power-off, but it does include
setting properties, pool upgrades, and dataset creation.
To access the history, run zpool history and give the pool name.
If you want more detailed information on your pools, the daily report
can also include zpool list output. Set daily_status_zfs_zpool_list to YES
to get the list. If you want to trim that output, showing only the status of
specific pools, list the desired pools in the daily_status_zpool
[Link] variable.
You can also have FreeBSD perform your pool scrubs. With the
scrubbing options set, FreeBSD performs a daily check to see if the pool
needs scrubbing, but only scrubs at configured intervals. To automatically
scrub every pool every 35 days, set daily_scrub_zfs_enable to YES in
[Link].
daily_scrub_zfs_enable="YES"
FreeBSD defaults to scrubbing all pools. You can’t explicitly exclude
specific pools from the daily scrub check. You can, however, explicitly list
the pools you want checked in daily_scrub_zfs_pools. Any pool not listed
isn’t scrubbed.
daily_scrub_zfs_pools="zroot prod test"
To change the number of days between scrubs, set
daily_scrub_zfs_default_threshold to the desired number of days.
daily_scrub_zfs_default_threshold="10"
If you want to scrub a specific pool on a different schedule, set
daily_scrub_zfs_${poolname}_threshold to the desired number of days.
Here we scrub the pool prod every 7 days.
daily_scrub_zfs_prod_threshold="7"
Any pool without its own personal threshold uses the default threshold.
Removing Pools
To get rid of a pool, use the zpool destroy command and the pool name.
Pool features that are enabled are available for use, but not actually used
yet. Your system might support a new type of compression, but has not
actually written any data to the pool using the new algorithm. This pool
could be imported on a system that doesn’t support the feature, because the
on-disk format has not changed to accommodate the feature. The new host
won’t see anything that makes it freak out.
Disabled pool features are available in the operating system but not
enabled. Nothing in the pool says that these features are available—the
presence of disabled features means they’re available in the operating
system. This pool is definitely usable on hosts that don’t support this
feature.
If the feature is active, the on-disk format has changed because the
feature is in use. Most commonly, this pool cannot be imported onto a
system that doesn’t support this feature. If the feature is active, but all
datasets using the feature are destroyed, the pool reverts the feature setting
to enabled.
A few features are “read-only compatible.” If the feature is in active
use, the pool could be partially imported onto a system that doesn’t support
the feature. The new host might not see some datasets on the pool, and it
can’t write any data to the pool, but it might be able to extract some data
from the datasets.
Creating a pool enables all features supported by that operating system’s
ZFS implementation. You could use the –d flag with zpool create to
disable all features in a new pool and then enable features more selectively.
Now that you understand how pools work, let’s put some actual data on
them.
1 ZFS follows the Unix tradition of not preventing you from doing daft things, because that
would also prevent you from doing clever things.
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 420M 17.9G 96K none
mypool/ROOT 418M 17.9G 96K none
mypool/ROOT/default 418M 17.9G 418M /
...
Restrict the type of dataset shown with the -t flag and the type. You
can show filesystems, volumes, or snapshots. Here we display snapshots,
and only snapshots.
This creates a new dataset, lamb, on the ZFS pool called mypool. If the
pool has a default mount point, the new dataset is mounted by default (see
“Mounting ZFS Filesystems” later this chapter).
Zvols show up in a dataset list like any other dataset. You can tell zfs
# ls -al /dev/zvol/mypool/avolume
crw-r----- 1 root operator 0x4d Mar 27 20:22 /dev/zvo
You can run newfs(8) on this device node, copy a disk image to it, and
generally use it like any other block device.
Renaming Datasets
You can rename a dataset with, oddly enough, the zfs rename command.
Give the dataset’s current name as the first argument and the new location
as the second.
Use the -f flag to forcibly rename the dataset. You cannot unmount a
filesystem with processes running in it, but the -f flag gleefully forces the
unmount. Any process using the dataset loses access to whatever it was
using, and reacts however it will.1
Moving Datasets
You can move a dataset from part of the ZFS tree to another, making the
dataset a child of its new parent. This may cause many of the dataset’s
properties to change, since children inherit properties from their parent.
Any properties set specifically on the dataset will not change.
Here we move a database out from under the zroot/var/db dataset, to a
new parent where you have set some properties to improve fault tolerance.
Note that since mount points are inherited, this will likely change the
dataset’s mount point. Adding the -u flag to the rename command will
cause ZFS not to immediately change the mount point, giving you time to
reset the property to the intended value. Remember that if the machine is
restarted, or the dataset is manually remounted, it will use its new mount
point.
You can rename a snapshot, but you cannot move snapshots out of their
parent dataset. Snapshots are covered in detail in Chapter 7.
Destroying Datasets
Sick of that dataset? Drag it out behind the barn and put it out of your
misery with zfs destroy.
If you add the -r flag, you recursively destroy all children (datasets,
snapshots, etc.) of the dataset. To destroy any cloned datasets while you’re
at it, use -R. Be very careful recursively destroying datasets, as you can
frequently be surprised by what, exactly, is a child of a dataset.
You might use the -v and -n flags to see exactly what will happen
when you destroy a dataset. The -v flag prints verbose information about
what gets destroyed, while -n tells zfs(8) to perform a dry run. Between
the two, they show what this command would actually destroy before you
pull the trigger.
ZFS Properties
ZFS datasets have a number of settings, called properties, that control how
the dataset works. While you can set a few of these only when you create
the dataset, most of them are tunable while the dataset is live. ZFS also
offers a number of read-only properties that provide information such as the
amount of space consumed by the dataset, the compression or
deduplication ratios, and the creation time of the dataset.
Each dataset inherits its properties from its parent, unless the property is
specifically set on that dataset.
Viewing Properties
The zfs(8) tool can retrieve a specific property, or all properties for a
dataset. Use the zfs get command, the desired property, and if desired, a
dataset name.
Under NAME we see the dataset you asked about, and PROPERTY
shows the property you requested. The VALUE is what the property is set
to.
The SOURCE is a little more complicated. A source of default means
that this property is set to ZFS’ default. A local source means that someone
deliberately set this property on this dataset. A temporary property was set
when the dataset was mounted, and this property reverts to its usual value
when the dataset is unmounted. An inherited property comes from a parent
dataset, as discussed in “Parent/Child Relationships” later in this chapter.
Some properties have no source because the source is either irrelevant
or inherently obvious. The creation property, which records the date and
time the dataset was created, has no source. The value came from the
system clock.
If you don’t specify a dataset name, zfs get shows the value of this
property for all datasets. The special property keyword all retrieves all of a
dataset’s properties.
If you use all and don’t give a dataset name, you get all the properties
for all datasets. This is a lot of information.
Show multiple properties by separating the property names with
commas.
You can also view properties with zfs list and the -o modifier. This
is most suited for when you want to view several properties from multiple
datasets. Use the special property name to show the dataset’s name.
You can also add a dataset name to see these properties in this format
for that dataset.
Changing Properties
Change properties with the zfs set command. Give the property name, the
new setting, and the dataset name. Here we change the compression
property to off.
Most properties apply only to data written after the property is changed.
The compression property tells ZFS to compress data before writing it to
disk. We talk about compression in Chapter 6. Disabling compression
doesn’t uncompress any data written before the change was made.
Similarly, enabling compression doesn’t magically compress data already
on the disk. To get the full benefit of enabling compression, you must
rewrite every file. You’re better off creating a new dataset, copying the data
over with zfs send, and destroying the original dataset.
Read-Only Properties
ZFS uses read-only properties to offer basic information about the dataset.
Disk space usage is expressed as properties. You can’t change how much
data you’re using by changing the property that says “your disk is half-
full.” (Chapter 6 covers ZFS disk space usage.) The creation property
records when this dataset was created. You can change many read-only
properties by adding or removing data to the disk, but you can’t write these
properties directly.
Filesystem Properties
One key tool for managing the performance and behavior of traditional
filesystems is mount options. You can mount traditional filesystems read-
only, or use the noexec flag to disable running programs from them. ZFS
uses properties to achieve the same effects. Here are the properties used to
accomplish these familiar goals.
atime
A file’s atime indicates when the file was last accessed. ZFS’ atime
property controls whether the dataset tracks access times. The default
value, on, updates the file’s atime metadata every time the file is accessed.
Using atime means writing to the disk every time it’s read.
Turning this property off avoids writing to the disk when you read a file,
and can result in significant performance gains. It might confuse mailers
and other similar utilities that depend on being able to determine when a
file was last read.
Leaving atime on increases snapshot size. The first time a file is
accessed, its atime is updated. The snapshot retains the original access
time, while the live filesystem contains the newly updated accessed time.
This is the default.
exec
The exec property determines if anyone can run binaries and commands on
this filesystem. The default is on, which permits execution. Some
environments don’t permit users to execute programs from their personal or
temporary directories. Set the exec property to off to disable execution of
programs on the filesystem.
The exec property doesn’t prohibit people from running interpreted
scripts, however. If a user can run /bin/sh, they can run /bin/sh /home/
mydir/[Link]. The shell is what’s actually executing—it only takes
instructions from the script.
readonly
If you don’t want anything writing to this dataset, set the readonly
property to on. The default, off, lets users modify the dataset within
administrative permissions.
setuid
Many people consider setuid programs risky.2 While some setuid programs
must be setuid, such as passwd(1) and login(1), there’s rarely a need to
have setuid programs on filesystems like /home and /tmp. Many sysadmins
disallow setuid programs except on specific filesystems.
ZFS’ setuid property toggles setuid support. If set to on, the filesystem
supports setuid. If set to off, the setuid flag is ignored.
User-Defined Properties
ZFS properties are great, and you can’t get enough of them, right? Well,
start adding your own. The ability to store your own metadata along with
your datasets lets you develop whole new realms of automation. The fact
that children automatically inherit these properties makes life even easier.
To make sure your custom properties remain yours, and don’t conflict
with other people’s custom properties, create a namespace. Most people
prefix their custom properties with an organizational identifier and a colon.
For example, FreeBSD-specific properties have the format
“[Link]:propertyname,” such as [Link]:swap. If the illumos
project creates its own property named swap, they’d call it
[Link]:swap. The two values won’t collide.
For example, suppose Jude wants to control which datasets get backed
up via a dataset property. He creates the namespace [Link].3
Within that namespace, he creates the property backup_ignore.
Jude’s backup script checks the value of this property. If it’s set to true,
the backup process skips this dataset.
Parent/Child Relationships
Datasets inherit properties from their parent datasets. When you set a
property on a dataset, that property applies to that dataset and all of its
children. For convenience, you can run zfs(8) commands on a dataset and
all of its children by adding the -r flag. Here, we query the compression
property on a dataset and all of its children.
Look at the source values. The first dataset, mypool/lamb, inherited this
property from the parent pool. In the second dataset, this property has a
different value. The source is local, meaning that the property was set
specifically on this dataset.
We can restore the original setting with the zfs inherit command.
The child now inherits the compression properties from the parent,
which inherits from the grandparent.
When you change a parent’s properties, the new properties
automatically propagate down to the child.
Our baby dataset uses gzip-9 compression. It’s inherited this property
from mypool/lamb. Now let’s move baby to be a child of second, and see
what happens to the compression property.
# zfs rename mypool/lamb/baby mypool/second/baby
# zfs get -r compression mypool/second
NAME PROPERTY VALUE SOURCE
mypool/second compression lz4 inherited from m
mypool/second/baby compression lz4 inherited from m
The child dataset now belongs to a different parent, and inherits its
properties from the new parent. The child keeps any local properties.
Data on the baby dataset is a bit of a tangle, however. Data written
before compression was turned on is uncompressed. Data written while the
dataset used gzip-9 compression is compressed with gzip-9. Any data
written now will be compressed with lz4. ZFS sorts all this out for you
automatically, but thinking about it does make one’s head hurt.
Removing Properties
While you can set a property back to its default value, it’s not obvious how
to change the source back to inherit or default, or how to remove custom
properties once they’re set.
To remove a custom property, inherit it.
This works even if you set the property on the root dataset.
To reset a property to its default value on a dataset and all its children,
or totally remove custom properties, use the zfs inherit command on the
pool’s root dataset.
# zfs mount
zroot/ROOT/default /
zroot/tmp /tmp
zroot/usr/home /usr/home
zroot/usr/ports /usr/ports
zroot/usr/src /usr/src
...
We have all sorts of datasets under /usr, but there’s no /usr dataset
mounted. What’s going on?
A zfs list shows that a dataset exists, and it has a mount point of /
usr. But let’s check the mountpoint and canmount properties of zroot/usr
With canmount set to off, the zroot/usr dataset is never mounted. Any
files written in /usr, such as the commands in /usr/bin and the packages
in /usr/local, go into the root filesystem. Lower-level mount points such
as /usr/src have their own datasets, which are mounted.
The dataset exists only to be a parent to the child datasets. You’ll see
something similar with the /var partitions.
Multiple Datasets with the Same Mount Point
Setting canmount to off allows datasets to be used solely as a mechanism to
inherit properties. One reason to set canmount to off is to have two datasets
with the same mount point, so that the children of both datasets appear in
the same directory, but might have different inherited characteristics.
FreeBSD’s installer does not have a mountpoint on the default pool,
zroot. When you create a new dataset, you must assign a mount point to it.
If you don’t want to assign a mount point to every dataset you create
right under the pool, you might assign a mountpoint of / to the zroot pool
and leave canmount set to off. This way, when you create a new dataset, it
has a mountpoint to inherit. This is a very simple example of using
multiple datasets with the same mount point.
Imagine you want an /opt directory with two sets of subdirectories.
Some of these directories contain programs, and should never be written to
after installation. Other directories contain data. You must lock down the
ability to run programs at the filesystem level.
Now give both of these datasets the mountpoint of /opt and tell them
that they cannot be mounted.
You can’t run programs from the db/data dataset, so turn off exec and
setuid. We need to write data to these directories, however.
We now have four datasets mounted inside /opt, two for binaries and
two for data. As far as users know, these are normal directories. No matter
what the file permissions say, though, nobody can write to two of these
directories. Regardless of what trickery people pull, the system won’t
recognize executables and setuid files in the other two. When you need
another dataset for data or programs, create it as a child of the dataset with
the desired settings. Changes to the parent datasets propagate immediately
to all the children.
Pools without Mount Points
While a pool is normally mounted at a directory named after the pool, that
isn’t necessarily so.
This pool no longer gets mounted. Neither does any dataset on the pool
unless you specify a mount point. This is how the FreeBSD installer creates
the pool for the OS.
Now you can mount this dataset with the mount(8) command:
You can also add ZFS datasets to the system’s /etc/fstab. Use the full
dataset name as the device node. Set the type to zfs. You can use the
standard filesystem options of noatime, noexec, readonly or ro, and
nosuid. (You could also explicitly give the default behaviors of atime,
exec, rw, and suid, but these are ZFS’ defaults.) The mount order is
normal, but the fsck field is ignored. Here’s an /etc/fstab entry that
mounts the dataset scratch/junk nosuid at /tmp.
scratch/junk /tmp nosuid 2 0
We recommend using ZFS properties to manage your mounts, however.
Properties can do almost everything /etc/fstab does, and more.
Tweaking ZFS Volumes
Zvols are pretty straightforward—here’s a chunk of space as a block
device; use it. You can adjust how a volume uses space and what kind of
device node it offers.
Space Reservations
The volsize property of a zvol specifies the volume’s logical size. By
default, creating a volume reserves an amount of space for the dataset equal
to the volume size. (If you look ahead to Chapter 6, it establishes a
refreservation of equal size.) Changing volsize changes the reservation.
The volsize can only be set to a multiple of the volblocksize property,
and cannot be zero.
Without the reservation, the volume could run out of space, resulting in
undefined behavior or data corruption, depending on how the volume is
used. These effects can also occur when the volume size is changed while it
is in use, particularly when shrinking the size. Adjusting the volume size
can confuse applications using the block device.
Zvols also support sparse volumes, also known as thin provisioning. A
sparse volume is a volume where the reservation is less than the volume
size. Essentially, using a sparse volume permits allocating more space than
the dataset has available. With sparse provisioning you could, say, create
ten 1 TB sparse volumes on your 5 TB dataset. So long as your volumes
are never heavily used, nobody will notice that you’re overcommitted.
Sparse volumes are not recommended. Writes to a sparse volume can
fail with an “out of space” error even if the volume itself looks only
partially full.
Specify a sparse volume at creation time by specifying the -s option to
the zfs create -V command. Changes to volsize are not reflected in the
reservation. You can also reduce the reservation after the volume has been
created.
Zvol Mode
FreeBSD normally exposes zvols to the operating system as geom(4)
providers, giving them maximum flexibility. You can change this with the
volmode property.
Now every block is stored twice. If one of the copies becomes corrupt,
ZFS can still read your file. It knows which of the blocks is corrupt because
its checksums won’t match. But look at the space use on the pool (the
REFER space in the pool listing).
Only the 10 MB we wrote were used. No extra copy was made of this
file, as you wrote it before changing the copies property.
With copies set to 2, however, if we either write another file or
overwrite the original file, we’ll see different disk usage.
The total space usage is 30 MB, 10 for the first file of random data, and
20 for 2 copies of the second 10 MB file.
When we look at the files with ls(1), they only show the actual size:
# ls -l /lamb/random*
-rw-r--r-- 1 root wheel 10485760 Apr 6 15:27 /lamb/r
-rw-r--r-- 1 root wheel 10485760 Apr 6 15:29 /lamb/r
1 Probably badly.
2 Properly written setuid programs are not risky. That’s why real setuid programs are risky.
3 When you name ZFS properties after yourself, you are immortalized by your work.
Whether this is good or bad depends on your work.
Chapter 5: Repairs & Renovations
Disks fill up. That’s what they’re for. Hardware fails for the same reason.
Sometimes you must take disks from one machine and put them in another,
or replace a failed hard drive, or give your database more space. This
chapter discusses how you can modify, update, and repair your storage
pools.
Before we get into that, let’s discuss how ZFS rebuilds damaged
VDEVs.
Resilvering
Virtual devices such as mirrors and RAID-Z are specifically designed to
reconstruct missing data on damaged disks. If a disk in your mirror pair
dies, you replace the disk and ZFS will copy the surviving mirror onto the
new one. If a disk in your RAID-Z VDEV fails, you replace the broken
drive and ZFS rebuilds that disk from parity data. This sort of data recovery
is a core feature of every RAID implementation.
ZFS understands both the filesystem and the underlying storage,
however. This gives ZFS latitude and advantages that traditional RAID
managers lack.
Rebuilding a disk mirrored by software or hardware RAID requires
copying every single sector from the good disk onto the replacement. The
RAID unit must copy the partition table, the filesystem, all the inodes, all
the blocks (even the free space), and all the data from one to the other.
We’ve all made a typo in /etc/[Link] that prevented a system from
booting. Fixing that typo on a system mirrored with UFS2 and gmirror(8)
required booting into single-user mode, fixing the typo, and rebooting. This
made one of the disks out of sync with the other. At the reboot, FreeBSD
noticed the discrepancy and brought the backup disk into sync by copying
every single sector of the current drive onto the backup. You might have
changed one or two sectors on the disk, but gmirror(8) had to copy the
whole thing. This might take hours or days.
ZFS knows precisely how much of each disk is in use. When ZFS
reassembles a replacement storage provider, it copies only the data actually
needed on that provider. If you replace a ZFS disk that was only one-third
data, ZFS copies only that one-third of a disk of data to the replacement.
Fixing a [Link] typo on a ZFS-mirrored disk requires sysadmin
intervention very similar to that needed on a gmirror(8) system. You get
into single-user mode. You fix the typo. You reboot. The difference is, ZFS
knows exactly which blocks changed on the disk. If only one of the disks
was powered on during single user mode (unlikely, but it could happen),
the two disks would be out of sync. Rather than try to copy the entire disk,
ZFS updates only the blocks needed to resynchronize the disks. The system
will probably repair the mirror before you can type a command to see how
it’s doing.
ZFS reconstruction is called resilvering. Like other ZFS integrity
operations, resilvering takes place only on live filesystems. You could
resilver in single-user mode, but it makes as much sense as installing
software in single-user mode.
Resilvering happens automatically when you replace a storage provider.
It also happens when a drive temporarily fails and is restored, such as when
a controller restarts or an external disk shelf reboots. While resilvering a
replacement storage provider can take quite a while, resilvering after a brief
outage probably takes only seconds.
If you use a RAID-Z pool normally while resilvering, resilvering can
greatly slow down. Resilvering and scrubbing are performed in order by
transaction groups, while normal read-write operations are pretty random.
ZFS’ resilver rate is throttled so that it won’t impact normal system
function.
Expanding Pools
Data expands to fill all available space. No matter how much disk space
you give a pool, eventually you’ll want more. To increase a pool’s size, add
a VDEV to the pool. For redundant pools, you can replace storage
providers with larger providers.
When you expand a pool, ZFS automatically starts writing data to the
new space. As the pool ages, ZFS tries to evenly balance available space
between the various providers. ZFS biases the writes to the drives so that
they will all become full simultaneously. A pool with one empty VDEV
and three nearly full ones has little choice but to put new data on the empty
VDEV, however. If you frequently create and delete files, per-disk load
eventually levels out.
Every VDEV within a zpool should be identical. If your pool is built
from a bunch of mirrors, don’t go adding a RAID-Z3 to the pool.
Add providers to VDEVs with the zpool attach command and VDEVs
to pools with the zpool add command.
You can’t remove a device from a non-mirror VDEV or any VDEV
from a pool. The –n flag to zpool add performs a “dry run,” showing you
the results of what running the command would be without actually
changing the pool. Running your zpool add command with the –n flag and
carefully studying the resulting pool configuration can give you warning
you’re about to shoot yourself in the foot.
Adding VDEVs to Striped Pools
Striped pools, with no redundancy, can be expanded up to the limits of the
hardware. Each non-redundant VDEV you add to a pool increases the odds
of a catastrophic failure, however, exactly like the RAID-0 device it
resembles. Remember, the failure of a single VDEV in a pool destroys the
entire pool. In a striped pool, each disk is a standalone VDEV.
Here’s a striped pool with three providers.
Use the zpool add command to add a storage provider to the scratch
pool.
The pool status now shows four storage providers, and you have your
additional disk space.
Adding VDEVs to Mirror Pools
You can add providers to a mirrored VDEV, but extra disks don’t increase
the available space. They become additional mirrors of each other. To add
space to a pool that uses mirrored VDEVs, add a new mirror VDEV to the
pool.
The zpool db currently has two mirror VDEVs in it.
# zpool status db
...
NAME STATE READ WRITE CKSUM
db ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/zfs0 ONLINE 0 0 0
gpt/zfs1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gpt/zfs2 ONLINE 0 0 0
gpt/zfs3 ONLINE 0 0 0
We need more space, so we want to add a third mirror VDEV. Use the
zpool add command to create a new mirror device and add it to the pool.
Here we use the providers gpt/zfs4 and gpt/zfs5 to create a new mirror
and add it to the pool.
The pool’s status now shows a new mirror VDEV, mirror-2, containing
two storage providers. As you write and delete data, the pool gradually
shifts load among all three VDEVs. To view how a pool currently
distributes data between the VDEVs, use zpool list -v.
config
Again, we use the zpool add command to create a new VDEV and add
it to the pool.
# zpool status
pool: FreeNAS02
state: DEGRADED
status: One or more devices could not be opened. Suffic
the pool to continue functioning in a degraded state
action: Attach the missing device and online it using 'z
see: [Link]
scan: scrub repaired 0 in 15h57m with 0 errors on Sun
config:
two disk failures, and will continue to function despite the missing drives.
A degraded pool has limited self-healing abilities, however. A pool
without redundancy does not have the information necessary for ZFS to
repair files. Our sample pool above has lost two disks out of its RAID-Z2
VDEV. It has zero redundancy. If a file suffers from bit rot, ZFS can’t fix it.
When you try to access that file, ZFS returns an error. Redundancy at the
dataset layer (with the copies property) might let ZFS heal the file.
If this pool experiences another drive failure, the pool will no longer
have a complete copy of its data and will fault.
Restoring Devices
If ZFS is kind enough to announce its problems, the least you can do is try
to fix them. The repair process depends on whether the drive is missing or
failed.
Missing Drives
A drive disconnected during operation shows up as either removed or
faulted. Maybe you removed a drive to check its serial number. Perhaps a
cable came loose. It might have been gremlins. In any case, you probably
want to plug it back in.
If the hardware notices that the drive is removed, rather than just saying
it’s missing, the hardware also notices when the drive returns. ZFS attempts
to reactivate restored drives.
Hardware that doesn’t notify the operating system when drives are
added or removed needs sysadmin intervention to restore service. Use the
zfs online command to bring a reconnected drive back into service.
If the drive is offline because it’s failed, though, you must replace it
rather than just turn it back on.
Replacing Drives
The hardest part of drive replacement often has nothing to do with ZFS:
you must find the bad drive. We advise using the physical location of the
disk in the GPT label for the disk when you first install the drive to make
later replacement easier. If you must identify a failed drive without this
information, use gpart list and smartctl to get the disk’s serial number
and manufacturer, then search the chassis for that drive. It’s the same
process discussed in Chapter 0, in reverse, with the added pressure of
unscheduled downtime. Worst case, you can find the serial number of every
drive that is still working, and process of elimination will reveal which
drive is missing.
Now don’t you wish you’d done the work in advance?
Once you find the failed drive and arrange its replacement, that’s where
we can start to use ZFS.
Faulted Drives
Use the command zpool replace to remove a drive from a resilient VDEV
and swap a new drive in. The drive doesn’t have to be failed—it could be a
perfectly healthy drive that you want to replace so that you can, say, do
maintenance on the disk shelf. Here’s a RAID-Z1 pool with a bad drive.
it with a new device. Give the pool name, the failed provider, and the new
provider.
This command might take a long time, depending on the disk’s capacity
and speed and the amount of data on the disk. You can view the status of
the replacement by checking the pool’s status.
# zpool status db
pool: db
state: DEGRADED
status: One or more devices is currently being resilvere
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Mar 16 [Link] 2
195M scanned out of 254M at 19.5M/s, 0h0m to go
47.3M resilvered, 76.56% done
config:
The resilvering time estimates given assume that disk activity is fairly
constant. Starting a big database dump halfway through the resilvering
process delays everything.
Replacing the Same Slot
Perhaps your hard drive array is full, and you don’t have the space to slot in
a new hard drive with a new device node. You must physically remove the
failed hard drive, mount the replacement in its space, partition and label the
drive, and replace the provider. That’s only slightly more complex.
This method has more risks, however. With zpool replace, the faulted
provider remains as online as it can manage until resilvering finishes. If you
lose a second disk in your RAID-Z1 during resilvering, there’s a chance the
pool has enough data integrity to survive. If you replace the faulty provider
before starting the rebuild, you lose that safety line. If your hardware
doesn’t give you the flexibility you need for a safer replacement, though,
check your backups and proceed.
Start by taking the failed provider offline. This tells ZFS to stop trying
to read or write to the device.
Here we attach a provider to the pool db. One of the existing providers
is gpt/zfs1, and we’re attaching gpt/zfs2. Look at zpool status db and
you’ll see the pool resilvering to synchronize the new provider with the
other disks in the mirror. Once the new provider is synchronized with the
pool, remove the failing provider from the virtual device.
ZFS will resilver the reactivated drive and resume normal function.
Log and Cache Device Maintenance
We advise using high-endurance SSD drives for your ZFS Intent Log
(write cache) and L2ARC (read cache). All too often you’ll find that “high
endurance” is not the same as “high enough endurance,” and you might
need to replace the device. Log devices use the same status keywords as
regular storage providers—faulted, offline, and so on. You might also need
to insert a log device or, less commonly, remove the log device.
While the examples show log devices, cache devices work exactly the
same way.
Adding a Log or Cache Device
To add a log or cache device to an existing pool, use zpool add, the pool
name, and the device type and providers. Here we add the log device gpt/
zlog0 to the pool db.
The pool immediately begins using the new log or cache device.
To add a mirrored log device, use the mirror keyword and the providers.
Mirroring the ZIL provides redundancy for writes, helping guarantee that
data written to disk survives a hardware failure. Here we mirror the devices
gpt/zlog0 and gpt/zlog1 and tell the pool db to use the mirror as a log
device.
Most often, a mirrored cache device isn’t a good use of fast disk. ZFS
handles the death of a cache device fairly well. Striping the cache across
multiple devices reduces load on any single device and hence reduces the
chance of failure.
Removing Log and Cache Devices
When you remove a log or cache device from a ZFS pool, ZFS stops
writing new data to the log, clears out the buffer of data from the log, and
releases the device.
To remove a standalone log or cache device, use zpool remove, the pool
name, and the device name. We previously added the device gpt/zlog0 as
a log device for the pool db. Let’s remove it.
# zpool status db
...
NAME STATE READ WRITE CKSUM
db ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/zfs0 ONLINE 0 0 0
gpt/zfs1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gpt/zfs2 ONLINE 0 0 0
gpt/zfs3 ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
gpt/zlog0 ONLINE 0 0 0
gpt/zlog1 ONLINE 0 0 0
The pool clears the log and removes the device from the pool.
Replacing Failed Log and Cache Devices
Replace a failed log or cache device, even a mirror member, exactly as you
would any other failed devices. Here we replace the device gpt/zlog0 with
gpt/zlog2.
This command should run silently. Run zpool list to verify the pool is
no longer on the system.
The system will refuse to export an active filesystem. Shut down any
daemons writing to the dataset and change your shell’s working directory
away from the dataset. Stop tailing files. You can use fstat(1) or lsof(8)
to identify processes using filesystems on that dataset.
Importing Pools
To see inactive pools attached to a system, run zpool import. This doesn’t
actually import any pools, but only shows what’s available for import.
# zpool import
pool: db
id: 8407636206040904802
state: ONLINE
action: The pool can be imported using its name or nume
config:
db ONLINE
raidz1-0 ONLINE
gpt/zfs1 ONLINE
gpt/zfs2 ONLINE
gpt/zfs3 ONLINE
gpt/zfs4 ONLINE
This shows that the pool db, also known by a long numerical identifier,
can be imported. You see the pool configuration at the bottom exactly as
you would for an active pool.
The status ONLINE does not mean that the pool is active, but rather that
the providers are all ready for use. As far as ZFS knows, this pool is ready
to go.
Import the pool with zpool import and the pool name or numerical ID.
# zpool import db
If you have multiple inactive pools with the same name, import the pool
by ID number instead.
You cannot import a pool if a pool of that name already exists, unless
you rename the pool.
Renaming Imported Pools
Some of us reuse pool names between machines. When Lucas needs a
dedicated pool for a database he always calls it db, because it’s short and
he’s lazy. This is great for standardization—everyone knows exactly where
the database files live. It’s an annoyance when moving disks to another
machine, though. Each machine can have only one pool of each name.
ZFS lets you permanently rename a pool by giving the new name after
the existing name. Here we import the pool called db under the name
olddb.
Datasets from the imported pool can be found in /olddb. These renames
are permanent. You can export and reimport the pool with its new name
forever.
To temporarily mount a pool at a location other than its usual mount
point, use the –R flag and an alternate root path.
This temporarily adds the path /dunno to all datasets in the imported
pool. Exporting the pool removes the extra path and unsets the altroot
property.
Use the altroot property when you don’t know what’s in a pool and
you don’t want to chance overlaying it on your existing datasets or
filesystems. Remember, BSD filesystems are stackable! You can also use it
in an alternate boot environment, where the imported pool might overlay
the running root filesystem and hide the tools you need to manage the pool.
Incomplete Pools
You can’t import a pool if it doesn’t have enough members to provide all
the needed data. Much as you can’t use a RAID-Z1 if it’s missing two
disks, you can’t import a RAID-Z1 with more than one missing disk.
# zpool import
pool: db
id: 8407636206040904802
state: UNAVAIL
status: One or more devices are missing from the system
action: The pool cannot be imported. Attach the missing
devices and try again.
see: [Link]
config:
# zpool import -D
The pool’s status will show up as ONLINE (DESTROYED). The
ONLINE means that the pool has everything it needs to function. Use the -
If a pool is missing too many storage providers, you cannot import it.
You cannot zpool online detached drives. Check the drive trays and make
sure the drives you want to import are attached and powered on. The next
time you run zpool import, reconnected drives will show up.
If a pool is missing its log device, add the -m flag to import it without
that device. An exported pool should have everything on the storage
providers.
# zpool import -m db
You can set pool properties when you import, by using the -o flag. Here
we import and rename a database pool, and also make it read-only.
We can now copy files from the old pool without damaging the pristine
copy of the data.
You might want to import a damaged pool, to try to recover some part
of the data on it. The –F flag tells zpool import to roll back the last few
transactions. This might return the pool to an importable state. You’ll lose
the contents of the rolled back transactions, but if this works, those
transactions were probably causing your problems anyway.
Larger Providers
One interesting fact about ZFS is that it permits replacing providers with
larger providers. If your redundant storage pool uses 4 TB disks you can
replace them with, say, 10 TB models and increase the size of your pool.
This requires replacing successive providers with larger ones.
A pool calculates its size by the smallest disk in each VDEV. If your
mirror has a 4 TB disk and a 10 TB disk in a single VDEV, the mirror
VDEV will only have 4 TB of space. There’s no sensible way to mirror 10
TB of data on a 4 TB disk! If you replace the 4 TB disk, however, you’ll be
able to expand the mirror to the size of the smallest disk.
One question to ask is: do you want your pools to automatically expand
when they can, or do you want to manually activate the expansion? ZFS
can automatically make the expansion work, but you need to set the
autoexpand property for each pool before starting. ZFS leaves this off by
# zpool list db
NAME SIZE ALLOC FREE FRAG EXPANDSZ CAP DEDUP H
db 59.5G 1.43G 58.1G 1% - 2% 1.00x O
If the hardware has enough physical space, add new drives and create
replacement providers. If you’re short on physical space, offline the
providers and replace the hard drives. Here we offline and replace the
drives.
This pool has three providers: gpt/zfs1, gpt/zfs2, and gpt/zfs3. We
first replace gpt/zfs1. Running gpart show -l shows that this provider is
on drive da1.
If you need to offline the drive to add the replacement drive, start by
identifying the physical location of drive da1. Prepare the replacement
drive as required by your hardware, then offline the pool from the provider.
This should return silently. Checking zpool status shows this provider
is offline. You can remove this hard drive from the system.
Insert the replacement drive, either in the space where the old drive was
removed or a new slot. The new drive should appear in /var/run/
[Link]. On this system, the new drive shows up as /dev/da4. Create
the desired partitioning on that drive and label it. If you’re not using serial
numbers in your labels, but labeling only by physical location, you can use
the same label. (Again, we use these short labels here because they’re
easier to read while learning.)
Let the pool finish resilvering before replacing any other providers.
Replacing a non-redundant unit during a resilvering will only cause pain. If
you’re using RAID-Z2 or RAID-Z3 it is possible to replace multiple disks
simultaneously, but it’s risky. An additional disk failure might make the
VDEV fail. Without the redundancy provided by the additional providers,
ZFS cannot heal itself. Each disk’s I/O limits will probably throttle
resilvering speed.
After your first provider resilvers, swap out your next smaller provider.
You will see no change in disk space until you swap out every provider in
the VDEV. To be sure you’ve replaced every providers with a larger one,
check zpool list.
# zpool list db
NAME SIZE ALLOC FREE FRAG EXPANDSZ CAP DEDUP
db 59.5G 1.70G 57.8G 0% 240G 2% 1.00x
Note that we now have new space in EXPANDSZ. This pool can be
grown.
If you set the pool to autoexpand before you started, it should grow on
its own. If not, manually expand each device in the pool with zpool online
-e.
# zpool upgrade -v
This system supports ZFS pool feature flags.
The following features are supported:
FEAT DESCRIPTION
--------------------------------------------------------
async_destroy (read-only compatible)
Destroy filesystems asynchronously.
empty_bpobj (read-only compatible)
Snapshots use less space.
lz4_compress
LZ4 compression algorithm support.
...
The features marked “read-only compatible” mean that hosts that don’t
support these feature flags can import these pools, but only as read-only.
See “Pool Import and Export” earlier this chapter for a discussion of
moving pools between hosts.
The FreeBSD release notes for each version indicate new ZFS features.
You do read the release notes carefully before upgrading, don’t you? If you
somehow miss that part of the documentation, zpool status tells you
which pools could use an upgrade. (Remember, just because a pool can
take an upgrade doesn’t mean that you should do the upgrade. If you might
need to revert an operating system upgrade, leave your pool features
alone!)
# zpool status db
pool: db
state: ONLINE
status: Some supported features are not enabled on the p
still be used, but some features are unavailable
action: Enable all features using 'zpool upgrade'. Once
the pool may no longer be accessible by software
the features. See zpool-features(7) for details.
...
You’ll also get a list of the new features supported by the upgrade.
Upgrade your pools by running zpool upgrade on the pool.
Pool upgrades non-reversibly add new fields to the existing pool layout.
An upgrade doesn’t rewrite existing data, however. While the new feature
might have problems, the mere availability of that feature flag on the disk is
very low risk.
If you plan to move disks to a system running an older operating
system, or to an operating system running an older version of OpenZFS,
you can enable pool features more selectively. Moving disks from a
FreeBSD 11 system to a FreeBSD 10 system requires carefully checking
pool features. Enable a single feature by setting its property to enabled.
The system might not boot without this update. The zpool upgrade
command prints a reminder to perform the update, but you’re free to ignore
it if you like. If you render your system unbootable, you might try to boot
from the newest FreeBSD-current ISO or a live CD and copy its boot
loader to your system.
FreeBSD ZFS Pool Limitations
FreeBSD does not yet support all ZFS features. Most unsupported features
don’t work because of fundamental differences between the FreeBSD and
Solaris architectures. People are actively developing solutions that will let
FreeBSD support all of ZFS’ features. We expect some of these to become
supported after this book goes to press.
At this time, hot spares do not work. Hot spares let ZFS automatically
swap a failed drive with an assigned standby on the system. This depends
on the forthcoming zfsd(8) implementation, which is still a work in
progress.
Now that you can fill a pool with data and repair the hardware, let’s play
with a couple of ZFS’ more useful features, clones and snapshots.
1 And now that Lucas has a good example of a problem, he can tell that friend that this
zpool is wounded. Although, to be certain he has good examples, he’ll probably wait until
he finishes this book. Being Lucas’ friend kind of sucks.
Chapter 6: Disk Space Management
ZFS makes it easy to answer questions like “How much free disk does this
pool have?” The question “What’s using up all my space?” is much harder
to answer thanks to snapshots, clones, reservations, ZVOLs, and referenced
data. These features might also cause problems when trying to use
traditional filesystem management methods on ZFS datasets. It’s possible
to find you don’t have enough space to delete files, which is terribly
confusing until you understand what’s happening.
ZFS offers ways to improve disk space utilization. Rather than requiring
the system administrator to compress individual files, ZFS can use
compression at the filesystem level. ZFS can also perform deduplication of
files, vastly improving disk usage at the cost of memory. We’ll see how to
evaluate these options and determine when to use each.
But let’s start by considering ZFS’ disk space usage.
Reading ZFS Disk Usage
The df(1) program shows the amount of free space on each partition,
while du(1) shows how much disk is used in a partition or directory tree.
For decades, sysadmins have used these tools to see what’s eating their free
space. They’re great for traditional filesystems. ZFS requires different ways
of thinking, however.
Consider this (heavily trimmed) list of ZFS datasets.
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zroot 17.5G 874G 144K none
zroot/ROOT 1.35G 874G 144K none
zroot/ROOT/default 1.35G 874G 1.35G /
zroot/usr 12.5G 874G 454M /usr
zroot/usr/local 1.84G 874G 1.84G /usr/local
...
According to this, the zroot pool has 17.5 GB in use. At first glance
you might think that zroot/ROOT and zroot/ROOT/default both use 1.35
GB. You’d be wrong.
The dataset zroot/ROOT uses 1.35 GB of data. There’s 1.35 GB of data
in this dataset. The dataset zroot/ROOT/default also uses 1.35 GB of data.
The zroot/ROOT/default dataset is included in the zroot/ROOT dataset,
however. It’s the same 1.35 GB of data.
Similarly, consider the 12.5 GB that zroot/usr uses. This dataset has
child datasets, such as zroot/usr/local, zroot/usr/obj, and so on. Each
of these datasets uses a chunk of data, often several gigabytes. The 12.5 GB
that zroot/usr uses includes everything beneath it.
With ZFS, you can’t just add up the amount of used space to get the
total.
The AVAIL column, or space available, is somewhat more reliable. The
pool zroot has 874 GB of empty space. Once you start using snapshots and
clones and all of the other ZFS goodness, you’ll find that this 874 GB of
space can contain many times that much data, thanks to referenced data.
Referenced Data
The amount of data included in a dataset is the referenced data. Look at the
REFER column in the listing above. The zroot pool and zroot/ROOT both
refer to 144 KB of space. That’s roughly enough to say that “yes, this
chunk of stuff exists.” It’s a placeholder. The dataset zroot/ROOT/default,
however, references 1.35 GB of data.
The referenced data is stuff that exists within this filesystem or dataset.
If you go into the zroot/ROOT/default filesystem, you’ll find 1.35 GB of
stuff.
So, you add up the referenced space and get the amount used? No,
wrong again. Multiple ZFS datasets can refer to the same collection of data.
That’s exactly how snapshots and clones work. That’s why ZFS can hold,
for example, several 10 GB snapshots in 11 GB of space.
Clones use space much like snapshots, except in a more dynamic
manner. Once you add in deduplication and compression, disk space usage
gets complicated really quickly.
And then there are even issues around freeing space.
Freeing Space
In many ZFS deployments, deleting files doesn’t actually free up space. In
most situations, deletions actually increase disk space usage by a tiny
amount, thanks to snapshots and metadata. The space used by those files
gets assigned to the most recent snapshot. To successfully manage ZFS,
you have to understand how the underlying features work and what ZFS
does when you delete data.
On a filesystem using snapshots and clones, newly freed space doesn’t
appear immediately. Many ZFS operations free space asynchronously, as
ZFS updates all the blocks that refer to that space. The pool’s freeing
property shows how much space ZFS still needs to reclaim from the pool.
If you free up a whole bunch of space at once, you can watch the freeing
property decrease and the free space increase. How quickly ZFS reclaims
space depends on your hardware, the amount of load, pool design,
fragmentation level, and how the space was used.
Asynchronously freeing space is easily understood: you look at the
freeing property and see how quickly it goes down. But to the uninitiated,
ZFS’ disk use can seem much weirder. Suppose you have a bunch of
dataset snapshots, and their parent dataset gets full. (We cover snapshots in
Chapter 7, but bear with us now.) You delete a couple of large ISOs from
the dataset. Deleting those files won’t free up any space. Why not?
Those ISO files still exist in the older snapshots. ZFS knows that the
files don’t exist on the current dataset, but if you go look in the snapshot
directories you’ll see those files. ZFS must keep copies of those deleted
files for the older snapshots as long as the snapshots refer to that data.
Snapshots contents are read-only, so the only way to remove those large
files is to remove the snapshots. If you have multiple snapshots, disk usage
gets more complex. And clones (Chapter 7), built on snapshots, behave the
same way.
Understanding precisely how a particular dataset uses disk space
requires spending some time with its properties.
Disk Space Detail
To see exactly where your disk space went, ask zfs list for more detail
on space usage with the -o space modifier.
The AVAIL column shows the amount of space available to each of the
datasets on this pool. ZFS shares the available space amongst all of the
datasets in the pool. This is taken from the ZFS property available. We
show how to limit usage with quotas and reservations later in this chapter.
The USED column shows the amount of space taken up by this dataset
and everything descended from it. Snapshots, ZVOLs, clones, regular files,
and anything else that uses space counts against this amount. This value
might lag behind changes for a few seconds as ZFS writes new files,
creates snapshots and child datasets, or makes other changes. The value
comes from the dataset’s used property.
The USEDBYSNAP column shows the amount of space used
exclusively by snapshots. When you first snapshot a dataset, the snapshot
takes almost no space, because it’s nearly identical to the original dataset.
As the dataset changes, however, the snapshots grow. As multiple
snapshots of the same dataset probably refer to the same data, it’s difficult
to say if deleting a single snapshot will free up any part of this space.
Completely removing all of this dataset’s snapshots will certainly free up
this amount of space, however. This value comes from the dataset’s
usedbysnapshots property. Chapter 7 discusses snapshots.
The USEDDS column shows the amount of space used by files on this
dataset. It excludes snapshots, reservations, child datasets, or other special
ZFS features. It comes from the dataset’s usedbydataset property. Chapter
4 covers datasets.
Under USEDBYREFRESERV you’ll see the space used by a
refreservation for this dataset, excluding its children. This value comes
from the dataset’s usedbyrefreserv property. See “Reservations and
Quotas” later in this chapter.
The USEDCHILD column shows how much space this dataset’s
children use, as shown by the usedbychildren property.
Compare the entry for zroot/usr in zfs list from the previous section
to the detailed space description. The zfs list result says that the dataset
uses 12.5 GB and refers to 454 MB. By breaking the space-specific list into
different categories, it’s very clear that this dataset uses 454 MB, and the
child datasets take up 12 GB.
Use zfs list -o space whenever you investigate disk usage.
Pool Space Usage
Sometimes you don’t care about the space usage of individual datasets.
Only the space available to the pool as a whole matters. If you look at a
pool’s properties, you’ll see properties that look an awful lot like the
amount of space used and free. They are, but a pool’s space properties
include space required for parity information. They don’t reflect the amount
of data you can fit on the pool.
If you have a mirror or a striped pool, the pool space information is
fairly close to reality. If you’re using RAID-Z1, you’ll lose one provider of
space to parity per virtual device in the pool. RAID-Z2 costs two disks per
VDEV, and RAID-Z3 costs three disks per VDEV. While you could, in
theory, use these properties, the pool’s current usage, and a bit of math to
get a good guess as to how much space you’re using, there’s an easier way:
ask zfs(8) about the pool’s root dataset.
and many other tools are not merely less than optimal—they’re actively
incorrect and give wrong or confusing answers for ZFS. We’re going to use
df(1) as an example, but many other tools have similar problems.
partition, how much of that space is currently used, and how much is
remaining as free space.
Walking through the filesystems doesn’t work for ZFS, because datasets
are not filesystems. ZFS exposes each dataset to the operating system as if
it were a separate filesystem. A dataset does not have a maximum size
(unless it has a quota). While you can set upper and lower limits on the size
of a dataset, all ZFS datasets have access to all of the available free space
in the pool.
To offer some semblance of compatibility with the traditional df(1)
tool, ZFS tells a little white lie. Since a ZFS dataset has no “size” in the
traditional filesystem sense, it sums the used space, and the entire pool’s
available free space together, and presents that value as the “size” of the
dataset.
Look at our example pools in the previous section. The zroot/ROOT/
default dataset uses 1.35 GB of space, and has 874 GB free. The total size
is 875.35 GB. Then look at the zroot/usr dataset. It has used 12.5 GB, and
has 874 GB free, for a total of 886.5 GB.
Now check some actual df(1) output for these datasets.
# df -h
Filesystem Size Used Avail Capacity Mounted
zroot/ROOT/default 875G 1.4G 874G 0% /
zroot/usr 874G 454M 874G 0% /usr
...
The root filesystem is 875 GB, and /usr is 874 GB, giving these two
partitions a total of 1749 GB, with 1748 GB free. Pretty impressive for a 1
TB disk, isn’t it? The “Capacity” column that showed what percentage of
the filesystem is in use is similarly bogus.
As datasets grow larger, the amount of free space shrinks. According to
df(1), the filesystem shrinks as space is used up, and grows when space is
freed.
Tools like df(1), and most other monitoring tools intended for
traditional filesystems, give incorrect answers. Beware of them! While they
might seem fine for a quick check, continuing to use these tools further
ingrains bad habits. Bad systems administration habits cause pain and
outages. When monitoring a dataset’s free space, make sure you are
measuring the actual amount of free space, rather than the percentage used.
If a traditional tool that shows “percent used” gives a meaningful result, it’s
only by accident. Your monitoring system needs ZFS-specific tools.
This behavior has interesting side effects when viewed with other tools
meant for traditional filesystems. A ZFS dataset mounted via Samba on a
Windows machine will show only a minuscule amount of space used, and
the remaining amount of space in the pool as the free space. As the pool
fills with data, Windows sees the drive shrink.
When using ZFS, develop the habit of managing it with accurate tools.
Limiting Dataset Size
ZFS’ flexibility means that users and applications can use disk space if it’s
available. This is very valuable, especially on long-lived systems, but
sometimes that’s the exact behavior you don’t want. You don’t want
datasets like /var/log expanding to fill your disk and, inversely, you want
to be certain that critical datasets like your database get the space they
need. If the main database runs out of space because Jude temporarily
stashed his collection of illicit potted fern photos in his home directory,
you’ll have an unpleasant and unnecessary meeting.1 That’s where quotas
and reservations come in.
A quota dictates the maximum amount of space a dataset and all its
descendants can use up. If you set a quota of 100 GB on the dataset
mounted as /home, the total amount of space used by /home and all the
datasets and snapshots beneath it cannot exceed 100 GB.
A reservation dictates an amount of space set aside for a dataset. To
ensure that a database always has room to write its files, use a reservation
to carve out an amount of disk space just for that dataset.
We’ll discuss reservations first, then proceed to quotas.
Reservations
A reservation sets aside a chunk of disk space for a dataset and its children.
The system will not allow other datasets to use that space. If a pool runs out
of space, the reserved space will still be available for writes to that dataset.
Suppose we reserve 100 GB out of a 1 TB pool for /var/db, where our
database stuffs its data files. This dataset has about 50 GB of data in it. A
log file runs amok, and fills the rest of the pool. We’ll get errors from the
other programs on the system saying that the disk is full—but the database
program will still have free space in /var/db. It might complain that it
can’t write program logs to /var/log/db, but that’s a separate issue.
ZFS manages reservations with two ZFS properties: refreservation
and reservation. A refreservation affects the dataset’s referenced data—
that is, it excludes snapshots, clones, and other descendants. A reservation
includes child datasets, snapshots, and so on. For an example, look at this
snippet from zfs list.
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zroot/usr 12.5G 874G 454M /usr
zroot/usr/local 1.84G 874G 1.84G /usr/local
zroot/usr/obj 6.89G 874G 6.89G /usr/obj
zroot/usr/ports 1.97G 874G 816M /usr/ports
…
No matter what, this host will now have 10 GB reserved for database
logs and 20 GB for the log files, including the database directory. We used
a refreservation for the database logs because we don’t want snapshots
counted against that reservation.
To remove a reservation, set it to none.
The log files can use no more than 100 GB, including snapshots, child
datasets, and everything else.
You can separately limit the amount of referenced data with a refquota.
This excludes child datasets and snapshots. Limiting both the size of the
entire dataset and the dataset’s referenced data can help you control the size
of your snapshots. For example, setting a refquota of 10 GB and a quota of
100 GB would tell you that you could always have 10 snapshots even if the
data completely changes. Similarly, if you want to exclude child datasets,
use a refquota.
Here we have separate refquotas for two logging datasets, and a quota
for both of the datasets together. If each dataset can reference up to 50 GB
on its own, the 500 GB quota means that no matter how the data changes,
you can have at least four snapshots of each.
Viewing Quotas
To see the quotas on a dataset, check the quota and refquota properties.
The /home directory has no quotas on it. Users may fill your hard drive
to its limits.
Quotas change the dataset’s maximum size and the free space in the
dataset. This pool has several hundred gigabytes free, but zfs list on this
dataset says otherwise.
ZFS knows that the parent dataset has a quota of 100 GB, and therefore
also sets that maximum size on the child datasets. If /var/log has 75 GB
free, and /var/log/db has 85 GB free, does that mean that these two
partitions have (75 + 85 =) 160 GB of free space? No, because like free
space in a pool, these two entries both refer to the same free space. The
dataset zroot/var/log/db entry seems to have more free space because
data in its parent dataset is not reflected in the child dataset’s usage.
Exceeded Quotas
If a user or process attempts to write something that would make the
dataset exceed its quota, it will get a quota error.
# cp [Link] [Link]
cp: [Link]: Disc quota exceeded
You’ll need to free some space, but remember that snapshots might
complicate that, as discussed in “Freeing Space” earlier in this chapter. If
you’ve set both a quota and a refquota, the user might be able to delete files
and free up space even though that increases the size of the filesystem’s
snapshots.
User and Group Quotas
User and group quotas control how much data a user or a group can write
to a dataset. Like dataset quotas, user and group quotas are controlled on a
per-dataset basis.
User and group quotas don’t apply to child filesystems, snapshots, and
clones. You must apply quotas to each individual dataset you want them to
affect.
Viewing Space Used and Existing Quotas per Dataset
The zfs userspace command lets you see how much space is used by
each user in a dataset. Here we examine the zroot/home dataset on our test
system. A system with complicated datasets might need several minutes to
run du(1), but zfs userspace finds all the files owned by each user nearly
instantaneously.
The user mwlucas has 1.16 GB of files—unsurprising. The root user has
298 MB of files in /home—somewhat surprising, but not shocking.
Somehow, though, user 179 has 7.29 MB of files in that dataset. This
system has no user 179, which is why the user is shown by UID rather than
username. A bit of digging shows that Lucas once used tar’s -p argument
when extracting a source code tarball, preserving the original file
ownership.
None of these users have quotas.
The zfs groupspace command shows how much space files owned by
each group use. For something more interesting, I’m checking the group
ownerships on the zroot/usr/local dataset.
The previous section showed that the mwlucas account had over a
gigabyte of data in it. The mwlucas account is over quota, and that user gets
an error whenever he tries to create a file.
$ touch test
touch: test: Disc quota exceeded
If a user has repeatedly abused shared directories like /tmp, assign them
a restrictive quota.
This user can use features like SSH agent forwarding, but he can’t
extract huge tarballs and monopolize the shared temporary space.
To remove a quota, set the quota to none.
Viewing Individual Quotas
If you’re interested in the quota set for a specific user or group, ask ZFS for
that one property.
Now you can let your teams squabble among themselves over their disk
space usage, without taking up your precious time. Congratulations!
ZFS Compression
You can’t increase the size of an existing disk, but you can change how
your data uses that disk. For decades, sysadmins have compressed files to
make them take up less space. We’ve written all kinds of shell scripts to
run our preferred compression algorithm on the files we know can be safely
compressed, and we’re always looking for additional files that can be
compressed to save space. And we all know about that previously unknown
log file that expands until it fills the partition and trips an alarm.3
ZFS takes away that problem by compressing files in real time, at the
filesystem level. Those log files your daemon writes? ZFS can compress
them as they’re written, rendering all those scripts irrelevant. This also
amortizes the cost of compression as the system compresses everything on
an ongoing basis rather than in a 3 AM frenzy of disk thrashing.
Compression imposes costs, however. Compression and decompression
require CPU time, so blindly enabling the tightest gzip compression
everywhere can add another constraint on disk performance. Any
performance losses are most often more than made up by the reduction in
disk activity, however. ZFS includes compression algorithms specifically
designed for filesystem use.
Enabling Compression
ZFS compression works on a per-dataset basis. You can enable
compression for some datasets but not others.
Enable and disable compression with the compression property. Here
we check the compression setting.
ZFS compresses files when the files are written to disk. If you have a
dataset full of text files, adding compression won’t make them shrink. To
reduce disk space used by files, you must rewrite all the files after enabling
compression.
Compression Algorithms
ZFS supports several compression algorithms. The default, LZJB, was
specifically designed for filesystem use. It can quickly compress and
decompress blocks with a modest compression ratio. It’s not the best
compression algorithm for routine use, however.
The LZ4 algorithm is a newer and faster filesystem-specific compression
algorithm. It outperforms LZJB in all ways. Not all data is compressible,
but LZ4 quickly detects incompressible files and doesn’t try to compress
them. When you enable compression for a dataset, use LZ4 unless you
have a specific use case for gzip compression.
The ZLE algorithm compresses strings of zeroes in files. It’s a minimal
compression system, and isn’t terribly useful for most files. LZ4 is far more
effective than ZLE, even on files with many runs of zeroes.
For special cases, ZFS supports gzip compression. Gzip uses much more
CPU time than LZ4, but can be more effective for some datasets. The
additional CPU time gzip requires makes the filesystem slower, but for data
that’s not accessed frequently the disk space savings might be worthwhile.
Gzip has nine compression levels, from 1 (the fastest but least effective)
to 9 (the slowest but most aggressive). Specify a gzip compression level
with a dash and the level.
If you specify gzip without a level, ZFS uses the gzip default of level 6.
Compression Properties
Several properties offer insight into how well ZFS compresses your data.
The compressratio property shows how compression has affected this
dataset and all its children, while the refcompressratio property allows
you to see how compression has impacted this dataset’s referenced data.
Datasets have two properties just for compression scenarios,
logicalreferenced and logicalused. A dataset’s referenced space includes
# du [Link]
280721 [Link]
The file size hasn’t changed, but we enabled compression. What’s going
on? Remember, compression, deduplication, and similar features work only
on files written after the feature is enabled. We must remove the file and
put it back.
# rm /db/*
# cp /home/mwl/[Link] /db
Wait a few seconds so that ZFS can write everything to disk, and see
what happens.
# du /db/[Link]
139577 /db/[Link]
The file uses only 139,577 blocks, or about 136 MB. It’s shrunk about
in half, as the dataset properties show.
Recopy the file to trigger LZ4 compression, wait a few seconds for ZFS
to do its accounting, and see what happens.
# du /db/[Link]
146076 /db/[Link]
LZ4 compresses this data to 142 MB. LZ4 is not as effective as LZJB on
this particular file. That’s not terribly shocking—different algorithms work
differently on different data.
Would gzip improve things further?
Re-copy the test file to the dataset and check the disk usage.
# du /db/[Link]
74104 /db/[Link]
This data now uses about 72 MB, and the dataset now has a
compressratio of 3.78. Gzip is clearly a better match for this particular
data. Compression almost quadrupled our effective disk space. While that’s
fairly impressive, let’s turn up the volume.
This dataset uses 48.7 MB of disk space. When you ignore the
compression, the dataset has 220 MB of data. A compressed dataset can
store more “logical data” than its size.
Here’s where the effectiveness of compression really comes into play.
The slowest part of reading and writing data is getting it on the storage
media. The physical media is the slowest part of a disk transaction. Writing
48.7 MB to disk takes about 22% as long as writing 220 MB. You can cut
your storage times by 78% by enabling compression, at the cost of a little
CPU time. If your disk can write 100 MB/s, then writing that 48.7 MB of
compressed data will take about half a second. If you look at it from the
perspective of the application that wrote the data, you actually wrote 220
MB in half a second, effectively 440 MB/s. We bet you didn’t think your
laptop disk could manage that!
If you are storing many small files, compression is less effective. Files
smaller than the sector size get a whole block allocated to them anyway. If
you want really, really effective compression, use a disk with actual 512-
byte physical sectors and tell ZFS to use that sector size.
Compression isn’t perfect. Sequential and random access can change
how well compression performs. Always test with your own data, in your
environment. Compression works well enough that FreeBSD enables lz4
compression in its default install.
Most CPUs are mostly idle. Make the lazy critters crunch some data!
Deactivating Compression
To deactivate compression, set the dataset’s compression property to off.
Much as activating compression only affects newly written files,
deactivating compression only affects new data. Compressed files remain
compressed until rewritten. ZFS is smart enough to know that a file is
compressed and to automatically decompress it when accessed, but it still
has the overhead.
You cannot purge all traces of compression from a dataset except by
rewriting all the files. You’re probably better off recreating the dataset.
Deduplication
Files repeat the same data over and over again, arranged slightly
differently. Multiple files contain even more repetition. More than half of
the data on your system might be duplicates of data found elsewhere. ZFS
can identify duplicate data in your files, extract and document it, thus
storing each piece of data only once. It’s very similar to compression.
Deduplication can reduce disk use in certain cases.
Many deduplication systems exist. At one extreme, you could
deduplicate all data on a byte-by-byte level. You could deduplicate this
book by identifying and recording the position of each letter and
punctuation mark, but the record would grow larger than the actual book.
At the other extreme, you could deduplicate multiple copies of entire files
by recording each only once.
ZFS snapshots could be said to deduplicate filesystem data. For
deduplicating files, ZFS deduplicates at the filesystem block level (shown
by the recordsize property). This makes ZFS good at removing duplicates
of identical files, but it realizes that files are duplicates only if their
filesystem blocks line up exactly. Using smaller blocks improves how well
deduplication works, but increases memory requirements. ZFS stores
identical blocks only once and stores the deduplication table in memory.
Enable deduplication on a dataset-by-dataset basis. Every time any file
on a deduplicated dataset is accessed by either reading or writing, the
system must consult the deduplication table. For efficient deduplication, the
system must have enough memory to hold the entire deduplication table.
ZFS stores the deduplication table on disk, but if the host must consult the
on-disk copy every time it wants to access a file, performance will slow to a
drag. (A host must read the dedup table from disk at boot, so you’ll get disk
thrashing at every reboot anyway.)
While deduplication sounds incredibly cool, you must know how well
your data can deduplicate and how much memory deduplication requires
before you even consider enabling it.
Deduplication Memory Needs
For a rough-and-dirty approximation, you can assume that 1 TB of
deduplicated data uses about 5 GB of RAM. You can more closely
approximate memory needs for your particular data by looking at your data
pool and doing some math. We recommend always doing the math and
computing how much RAM your data needs, then using the most
pessimistic result. If the math gives you a number above 5 GB, use your
math. If not, assume 5 GB per terabyte.
If you short your system on RAM, performance will plummet like Wile
E. Coyote.5 Don’t do that to yourself.
Each filesystem block on a deduplicated dataset uses about 320 bytes of
RAM. ZFS’ zdb(8) tool can analyze a pool to see how many blocks would
be in use. Use the -b flag and the name of the pool you want to analyze.
# zdb -b data
Traversing all blocks to verify nothing leaked ...
bp count: 139025
ganged count: 0
bp logical:18083569152 avg: 130074
bp physical: 18070658560 avg: 129981 compression: 1
bp allocated: 18076997120 avg: 130026 compression: 1
bp deduped: 0 ref>1: 0 deduplication: 1
SPA allocated: 18076997120 used: 1.81%
The “bp count” shows the total number of ZFS blocks stored in the
pool. This pool uses 139,025 blocks. While ZFS uses a maximum block
size of 128 KB by default, small files use smaller blocks. If a pool has
many small files, you’ll need more memory.
In the third line from the bottom, the “used” entry shows that this pool is
1.81% (or 0.0181) used. Assume that the data in this pool will remain fairly
consistent as it grows. Round up the number of used blocks to 140,000.
Divide the used blocks by how full the block is, and we see that the full
pool will have about (140,000 / 0.0181 = ) 7,734,806 blocks. At 320 bytes
per block, this data uses 2,475,138,121 bytes of RAM, or roughly 2.3 GB.
That’s less than half the rule of thumb. Assume that the ZFS
deduplication table on this pool will need 5 GB of RAM per terabyte of
storage.
ZFS lets metadata like the deduplication table take up only 25% of the
system’s memory. (Actually, it’s 25% of the Adaptive Replacement Cache,
or ARC, but we’ll go into detail on that in FreeBSD Mastery: Advanced
ZFS.) Each terabyte of deduplicated pool means that the system needs at
least 20 GB of RAM. Even if you go with your more hopeful math based
on block usage, where each terabyte of disk needs 2.3 GB of RAM, the
25% limit means that each terabyte of deduplicated pool needs about 10
GB of RAM. (In FreeBSD Mastery: Advanced ZFS, we discuss adjusting
this limit so that people who want to shoot themselves in the foot can do it
well.)
Deduplication Effectiveness
ZFS can simulate deduplication and provide a good estimate on how well
the data would deduplicate. Run zdb -S on your pool. You’ll get a nice
histogram of block utilization and common elements, which you can
completely ignore in favor of the last line.
# zdb -S data
Simulated DDT histogram:
...
dedup = 3.68, compress = 1.00, copies = 1.00, dedup * co
Our pool data can be deduplicated 3.68 times. If all the data in this pool
were this deduplicatable, we could fit 3.68 TB of data in each terabyte of
storage. This data is exceptionally redundant, however. For comparison, on
Lucas’ desktop, the zroot pool that contains the FreeBSD operating
system, user programs, and home directories, is about 1.06 deduplicatable.
That’s not bad. We still need a machine with 20 GB of RAM per
terabyte of deduplicated pool, mind you, but we can now make a cost/
benefit calculation based on the current needs of hardware. You can also
compare your test data’s deduplicatability with its compressibility.
Is the memory expense worth it? That depends on the cost of memory
versus the cost of storage.
Every time we’ve assessed our data for deduplicatability and
compressibility, and then priced hardware for each situation, we’ve found
that enhancing compression with faster disks and more CPU was more
cost-effective than loads of memory for deduplication. Deduplication does
not improve disk read speed, although it can improve cache hit rates. It
only increases write speed when it finds a duplicate block. Deduplication
also significantly increases the amount of time needed to free blocks, so
destroying datasets and snapshots can become incredibly slow.
Compression affects everything without imposing these penalties.
Deduplication probably only makes sense when disk space is
constrained, expensive, and very high performance. If you need to cram
lots of deduplicable data onto a pool of SSDs, dedup might be for you.
Everyone’s data is different, however, so check yours before making a
decision.
Enabling Deduplication
The ZFS property dedup activates and deactivates deduplication.
1 On the plus side, you’ll have an excuse to throw Jude under the bus. Metaphorically, if
you prefer.
2 Sysadmins who consider “being fair to users” outside their normal remit can use
refquotas as a way of reducing exposure to user cooties.
3 You don’t monitor disk space usage? Well, an outage is merely a different sort of alarm.
4 Yes, Mr. Pedantic, your real-world data is composed only of ones and zeroes. Go
compress your data down to a single 0 and a 1 and see how well that works for you.
5 Also like Wile E. Coyote, painting a tunnel on the wall won’t help.
Chapter 7: Snapshots and Clones
One of ZFS’ most powerful features is snapshots. A filesystem or zvol
snapshot allows you to access a copy of the dataset as it existed at a precise
moment. Snapshots are read-only, and never change. A snapshot is a frozen
image of your files, which you can access at your leisure. While backups
normally capture a system across a period of minutes or hours, running
backups on a snapshot means that the backup gets a single consistent
system image, eliminating those tar: file changed as we read it
messages and its cousins.
While snapshots are read-only, you can roll the dataset back to the
snapshot’s state. Take a snapshot before upgrading a system, and if the
upgrade goes horribly wrong, you can fall back to the snapshot and yell at
your vendor.
Snapshots are the root of many special ZFS features, such as clones. A
clone is a fork of a filesystem based on a snapshot. New clones take no
additional space, as they share all of their dataset blocks with the snapshot.
As you alter the clone, ZFS allocates new storage to accommodate the
changes. This lets you spin up several slightly different copies of a dataset
without using a full ration of disk space for each. You want to know that
your test environment tightly mirrors the production one? Clone your
production filesystem and test on the clone.
Snapshots also underpin replication, letting you send datasets from one
host to another.
Best of all, ZFS’ copy-on-write nature means that snapshots are “free.”
Creating a snapshot is instantaneous and consumes no additional space.
Copy-on-Write
In both ordinary filesystems and ZFS, files exist as blocks on the disk. In a
traditional filesystem, changing the file means changing the file’s blocks. If
the system crashes or loses power when the system is actively changing
those blocks, the resulting shorn write creates a file that’s half the old
version, half the new, and probably unusable.
ZFS never overwrites a file’s existing blocks. When something changes
a file, ZFS identifies the blocks that must change and writes them to a new
location on the disk. This is called copy-on-write, or COW. The old blocks
remain untouched. A shorn write might lose the newest changes to the file,
but the previous version of the file still exists intact.
Never losing a file is a great benefit of copy-on-write, but COW opens
up other possibilities. ZFS creates snapshots by keeping track of the old
versions of the changed blocks. That sounds deceptively simple, doesn’t it?
It is. But like everything simple, the details are complicated. We talked
about how ZFS stores data in Chapter 3, but let’s go deeper.
ZFS is almost an object-oriented filesystem. Metadata, indexing, and
data are all different types of objects that can point to other objects. A ZFS
pool is a giant tree of objects, rooted in the pool labels.
Each disk in a pool contains four copies of the ZFS label: two at the
front of the drive and two at the end. Each label contains the pool name, a
Globally Unique ID (GUID), and information on each member of the
VDEV. Each label also contains 128 KB for uberblocks.
The uberblock is a fixed size object that contains a pointer to the Meta
Object Set (MOS), the number of the transaction group that generated the
uberblock, and a checksum.
The MOS records the top-level information about everything in the
pool, including a pointer to a list of all of the root datasets in the pool. In
turn each of these lists points to similar lists for their children, and to
blocks that describe the files and directories stored in the dataset. ZFS
chains these lists and pointer objects as needed for your data. At the bottom
of the tree, the leaf blocks contain the actual data stored on the pool.
Every object contains a checksum and a birth time. The checksum is
used to make sure the object is valid. The birth time is the transaction group
(txg) number that created the block. Birth time is a critical part of snapshot
infrastructure.
Modifying a block of data touches the whole tree. The modified block of
data is written to a new location, so the block that points to it is updated.
This pointer block is also written to a new location, so the next object up
the tree needs updating. This percolates all the way up to the uberblock.
The uberblock is the root of the tree. Everything descends from it. ZFS
can’t modify the uberblock without breaking the rules of copy-on-write, so
it rotates the uberblock. Each label reserves 128 KB for uberblocks. Disks
with 512-byte sectors have 128 uberblocks, while disks with 4 KB sectors
have 32 uberblocks. If you have a disk with 16 KB sectors, it will have
only eight uberblocks. Each filesystem update adds a new uberblock to this
array. When the array fills up, the oldest uberblock gets overwritten.
When the system starts, ZFS scans all of the uberblocks, finds the
newest one with a valid checksum, and uses that to import the pool. Even if
the most recent update somehow got botched, the system can import a
consistent version of what the pool was like a few seconds before that. If
the system failed during a write, the very last data is lost—but that data
never made it to disk anyway. It’s gone, and ZFS can’t help you. Using
copy-on-write means that ZFS doesn’t suffer from the problems that make
fsck(8) necessary for traditional filesystems.
How Snapshots Work
When the administrator tells ZFS to create a snapshot, ZFS copies the
filesystem’s top-level metadata block. The live system uses the copy,
leaving the original for use by the snapshot. Creating the snapshot requires
copying only the one block, which means that ZFS can create snapshots
almost instantly. ZFS won’t modify data or metadata inside the snapshot,
making snapshots read-only. ZFS does record other metadata about the
snapshot, such as the birth time.
Snapshots also require a new piece of ZFS metadata, the dead list. A
dataset’s dead list records all the blocks that were used by the most recent
snapshot but are no longer part of the dataset. When you delete a file from
the dataset, the blocks used by that file get added to the dataset’s dead list.
When you create a snapshot, the live dataset’s dead list is handed off to the
snapshot and the live dataset gets a new, empty dead list.
Deleting, modifying, or overwriting a file on the live dataset means
allocating new blocks for the new data and disconnecting blocks containing
old data. Snapshots need some of those old data blocks, however. Before
discarding an old block, the system checks to see if a snapshot still needs it.
ZFS compares the birth time of the old data block with the birth time of
the most recent snapshot. Blocks younger than the snapshot can’t possibly
be used by that snapshot and can be tossed into the recycle bin. Blocks
older than the snapshot birth time are still used by the snapshot, and so get
added to the live dataset’s dead list.
After all this, a snapshot is merely a list of which blocks were in use in
the live dataset at the time the snapshot was taken. Creating a snapshot tells
ZFS to preserve those blocks, even if the files that use those blocks are
removed from the live filesystem.
This means that ZFS doesn’t keep copies of every version of every file.
When you create a new file and delete it before taking a snapshot, the file is
gone. Each snapshot contains a copy of each file as it existed when the
snapshot was created. ZFS does not retain a history like DragonFly’s
HAMMER.
Deleting a snapshot requires comparing block birth times to determine
which blocks can now be freed and which are still in use. If you delete the
most recent snapshot, the dataset’s current dead list gets updated to remove
blocks required only by that snapshot.
Snapshots mean that data can stick around for a long time. If you took a
snapshot one year ago, any blocks with a birth date more than a year ago
are still in use, whether you deleted them 11 months ago or before lunch
today. Deleting a six-month-old snapshot might not free up much space if
the year-old snapshot needs most of those blocks.1
Only once no filesystems, volumes, or snapshots use a block, does it get
freed.
Using Snapshots
To experiment with snapshots, let’s create a new filesystem dataset and
populate it with some files.
Notice that the amount of space used by the snapshot (the USED
column) is 0. Every block in the snapshot is still used by the live dataset, so
the snapshot uses no additional space.
Dataset Changes and Snapshot Space
Now change the dataset and see how it affects the snapshots. Here we
append a megabyte of new crud to the random file and update the date file.
Think back on how snapshots work. The file of random data grew by
one megabyte, but that’s not in the old snapshot. The date file was replaced,
so the snapshot should have held onto the blocks used by the older file.
Let’s see what that does to the snapshot’s space usage.
# zfs list -t snapshot -r mypool/sheep
NAME USED AVAIL REFER MOUNTPOINT
mypool/sheep@snap1 72K - 1.11M -
The snapshot now uses 72 KB. The only space consumed by the
snapshot was for the replaced block from the date file. The snapshot
doesn’t get charged for the new space sucked up by the larger random file,
because no blocks were overwritten.
Now let’s create a second snapshot and see how much space it uses.
The REFER column shows that the first snapshot gives you access to
1.11 MB of data, while the second lets you see 2.11 MB of data. The first
snapshot uses 72 KB of space, while the second uses none. The second
snapshot is still identical to the live dataset.
But not for long. Let’s change the live dataset by overwriting part of the
random file and see how space usage changes.
We’ve overwritten one megabyte of the random data file. The first
snapshot’s space usage hasn’t changed. The second snapshot shows that it’s
using 1 MB of space to retain the overwritten data, plus some metadata
overhead.
Recursive Snapshots
ZFS lets you create recursive snapshots, which take a snapshot of the
dataset you specify and all its children. All of the snapshots have the same
name. Use the -r flag to recursively snapshot a system. Here we snapshot
the boot pool with a single command.
We now have a separate snapshot for each dataset in this pool, each
tagged with @beforeupgrade.
We can now abuse this system with wild abandon, secure in knowing
that a known good version of the system exists in snapshots.
Advanced Dataset and Snapshot Viewing
Once you grow accustomed to ZFS you’ll find that you’ve created a lot of
datasets, and that each dataset has a whole bunch of snapshots. Trying to
find the exact snapshots you want gets troublesome. While you can always
fall back on grep(1), the ZFS command line tools have very powerful
features for viewing and managing your datasets and snapshots. Combining
options lets you zero in on exactly what you want to see. We started with
zfs list in Chapter 4, but let’s plunge all the way in now.
Many of these options work for other types of datasets as well as
snapshots. If you stack filesystems 19 layers deep, you’ll probably want to
limit what you see. For most of us, though, snapshots are where these
options really start to be useful. Many features also work with zpool(8)
and pools, although pools don’t get as complicated as datasets.
A plain zfs list displays filesystem and zvol datasets, but no
snapshots.
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 4.62G 13.7G 96K none
mypool/ROOT 469M 13.7G 96K none
mypool/ROOT/default 469M 13.7G 469M /
mypool/avolume 4.13G 17.8G 64K -
…
To view a pool or dataset and all of its children, add the -r flag and the
pool or dataset name.
Once you get many datasets, you’ll want to narrow this further.
View Datasets by Type
To see only a particular type of dataset, use the -t flag and the dataset type.
You can view filesystems, volumes, snapshots, and bookmarks.
Be sure you give the complete name, including the snapshot part. Here
we tell zfs list to show only snapshots, and then give it the name of a
filesystem dataset; zfs(8) very politely tells us to be consistent in what we
ask for.
We used the -r flag before to show a dataset and all of its children. It
also works with the list of snapshots.
If you have many layers of datasets you might want a partially recursive
view. While -r shows all the children, the -d option limits the number of
layers you see. Limit the depth to 1 and you get the snapshots of only a
single dataset.
You can display any property as a column as well. Here we list some
common filesystem properties for each dataset.
Once you’ve run with this for a while, however, we’re highly confident
you’ll turn it back off. Accumulated snapshots quickly overwhelm
everything else.
Scripts and ZFS
Sysadmins like automation. One annoying thing about automation is when
you must run a command and parse the output. Making output more
human-friendly often makes it less automation-friendly. The ZFS
developers were all too familiar with this problem, and included command-
line options to eliminate most of it.
The -p option tells zfs(8) and zpool(8) to print exact values, rather
than human-friendly ones. A pool doesn’t actually have 2 TB free—it’s just
a number that rounds to that. Using ‑p prints the actual value in all its
glory.
The -H option tells zfs(8) and zpool(8) to not print headers, and to
separate columns with a single tab, instead of making them line up nicely,
the way humans like. You are human, aren’t you?
Combined together, these options transform output from something
easily understood by humans to something you can feed straight to a script.
Yes, that’s the real spacing. Orderly columns are for humans, silly.
Per-Snapshot Space Use
An especially useful property for snapshots is the written property, which
gives you an idea of how much new data that snapshot contains.
# ls /mypool/sheep/.zfs
total 1
dr-xr-xr-x 2 root wheel 2 Mar 29 00:30 shares
dr-xr-xr-x 2 root wheel 2 Mar 30 16:27 snapshot
# ls -l /mypool/sheep/.zfs/snapshot
total 1
drwxr-xr-x 2 root wheel 5 Mar 29 00:40 snap1
drwxr-xr-x 2 root wheel 5 Mar 29 00:40 second
Go into that directory and you’ll find yourself inside the snapshot’s root
directory. Each file in the snapshot is preserved exactly as it was when the
snapshot was taken, down to the file access time. To recover individual
files from a snapshot, copy them back into the main filesystem.
Secret Snapdir
The .zfs snapdir is hidden by default. It won’t show up even if you run ls
-lA. This prevents backup programs, rsync, and similar software from
traversing into it. If you want the .zfs directory to show up, set the
dataset’s snapdir property to visible.
You cannot access a snapshot via the hidden .zfs directory while it is
manually mounted. Even a mounted snapshot is still read-only.
Deleting Snapshots
Snapshots prevent the blocks they use from being freed. This means you
don’t get that space back until you stop using those blocks, by removing all
snapshots that reference them.
Create a new snapshot, and then remove it:
dataset, and it has to be a snapshot, so you need only the brief name of the
snapshot: second.
If you are sure, drop the -vn and truly destroy the snapshots:
The snapshots are gone. Your users are now free to tell you that they
needed that data.
Rolling Back
Snapshots don’t only show you how the filesystem existed at a point in the
past. You can revert the entire dataset to its snapshot state. Going to do an
upgrade? Create a snapshot first. If it doesn’t work, just roll back. Use the
zfs rollback command to revert a filesystem to a snapshot. But once you
# cat /delorean/[Link]
“I broke the future”
Oh no, the future is broken. Let’s return to the present. Run zfs
rollback and give the name of the snapshot you want to use.
This takes less time than you might think. Remember, all the data and
metadata is already on disk. ZFS only switches which set of metadata it
uses. Once the rollback finishes, the live filesystem contains all the files
from the chosen snapshot.
# cat /delorean/[Link]
“this is the present”
The zfs rollback command can destroy all the intermediate snapshots
for you if you use the recursive (-r) flag. This is not the same kind of
multi-dataset recursion used in creating and destroying snapshots. Using
rollback -r does not roll back the children. You must roll back each
dataset separately.
You’ve gone back in time, and can now try your risky and painful
upgrade again. Congratulations!
Diffing snapshots
Sometimes you really want to know what changed between the time the
snapshot was taken and now. If the database server started crashing at noon
today, you probably want to compare the state of the filesystem right now
with the 11 AM snapshot so you can see if anything changed. You could
use find(1) to look for files modified since the snapshot was created, or
you could use diff(1) to compare the files from the snapshot with the ones
from the live filesystem. ZFS already has this information, however, and
makes it available with zfs diff.
To look at the difference between a snapshot and the live filesystem, run
zfs diff and give it a snapshot name.
<pre># zfs diff mypool/sheep@later
M /mypool/sheep/randomfile</pre>
Files can be in four states. A “-” means that the file was removed. A “+”
means that the file was added. An “M” indicates that the file has been
modified. And an “R” shows that the file has been renamed. Our example
here shows that the file /mypool/sheep/randomfile was modified after the
snapshot was taken.
You can also compare two snapshots.
removed. We have a file rename and, again, the file randomfile was
modified.
You can also get even more detail. If you add the -t flag, the output
includes the change’s timestamp from the inode. The -F flag includes the
type of the file. Check zfs(8) to get the full list of file types.
Automatic Snapshot Regimen
Snapshots are useful even if you create them only for special events. If you
create snapshots automatically on a schedule, however, they become
extremely useful. It’s simple enough to schedule creating a recursive
snapshot of your system every 15 minutes. If you keep all of these
snapshots, your pool fills up, however. Automated snapshots need rotating
and discarding just like backup tapes.
Rotation Schedule
The hard part of scheduling the creation and destruction of snapshots is
figuring out how you might use the snapshots. Who are your users? What
applications might need snapshots? We can’t answer those questions for
you.
One common setup is built around weekly, daily, hourly, and 15-minute
snapshots. You take weekly snapshots that you keep for two months. Daily
snapshots are retained for two weeks. Your hourly snapshots are retained
for three days. Then you take snapshots every 15 minutes and keep those
for six hours.
Maybe you need only four 15-minute snapshots. Or you must retain
monthly snapshots for a year. The regimen right for you depends on many
factors. How important is your data? How far back might you have to
reach? How space constrained are you? How often do your files change,
and how much new data gets written each day? Do you have IT audit
controls that dictate how long certain data must be retained? Talk with
other people on your team, and figure out a schedule that works for your
organization.
Once you have your desired schedule, ZFS tools can help you deploy it.
ZFS Tools
Many scripts and software packages can manage ZFS snapshots for you.
We recommend ZFS Tools ([Link] as it
doesn’t use a configuration file. It does need cron(8), but you don’t have to
mess with any kind of [Link]. ZFS Tools takes its configuration
from user-defined properties set within ZFS. This means that new datasets
automatically inherit their snapshot configuration from their parent. When a
system has dozens of datasets and you’re constantly creating and removing
them, inherited configuration saves lots of time.
Install ZFS Tools from packages.
ZFS Tools come with many scripts and applications, but right now
we’re concerned with zfs-auto-snapshot.
zfs-auto-snapshot
The zfs-auto-snapshot Ruby script creates and deletes snapshots. It takes
two arguments, the name of the snapshot, and the number of those
snapshots to keep. For example, running zfs-auto-snapshot frequent 4
creates a recursive snapshot named frequent, and keeps four snapshots of
each dataset.
Combined with cron(8), zfs-auto-snapshot lets you create whatever
snapshots you like, at any time interval desired, and then discard them as
they age out.
ZFS Tools come with a default crontab to create snapshots on a
schedule that the developers hope will fit most people’s needs. It starts by
setting $PATH so that zfs-auto-snapshot can find Ruby. It then has
entries to create 15-minute, hourly, daily, weekly, and monthly snapshots.
Let’s look at each.
zfs-auto-snapshot runs on the 15th, 30th, and 45th minute of each hour.
It creates a snapshot called frequent on each dataset. When a dataset has
more than four frequent snapshots, the oldest snapshots get removed until
only four remain.
You can also disable just specific classes of snapshots. A dataset that
doesn’t change frequently probably doesn’t need frequent or hourly
snapshots. zfs-auto-snapshot checks for sub-properties of [Link]:auto-
snapshot named after the snapshot period. For example, the property that
command line and blew away lots of hourly snapshots. You can name your
hourly snapshots monthly, and your yearly snapshots daily. If you’re short
on people who detest you and all you stand for, this is a wonderful way to
remedy that problem.
Holds
Sometimes, you want a specific snapshot to be retained despite any
automatic retention schedule or a desperate late-night effort to clean the
pool. Maybe there was an incident, or this is the starting point for some
replication. If you need to keep a snapshot, place a hold on it, like your
bank does when it doesn’t want you to spend your money.
Use zfs hold, a tag name, and the snapshot name. A tag name is a
human-readable label for this particular hold.
This locks the snapshot and assigns your tag name. One snapshot can
have many holds on it, so you can create holds for different purposes.
Holds can also be recursive. To lock all of the snapshots of the same
name on child datasets, using a common tag, use -r.
Release a hold on the dataset with zfs release, giving the tag and the
dataset name.
You can now destroy the snapshot. If only getting the bank to release
your funds was this easy!
Releasing a hold on a snapshot does not release any hold on its children,
however.
To recursively release all of the holds on a snapshot and its children, use
the -r flag.
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 4.74G 13.5G 96K none
...
mypool/sheep 10.3M 13.5G 6.10M /mypool/sheep
mypool/dolly 8K 13.5G 2.11M /mypool/dolly
mypool/second 192K 13.5G 96K legacy
mypool/second/baby 96K 13.5G 96K legacy
...
The dolly dataset looks like a normal dataset, except in its space usage.
The REFER column shows that it has 2 MB of data, but under USED it
takes up only 8 KB. The data it contains is from the original snapshot. A
clone consumes space for only newly written data, whether it be new files
or overwriting old ones.
Viewing Clones
Clones appear the same as a regular dataset. In zfs list, you won’t notice
any difference between a clone and any other dataset.3 Clones record their
source snapshot in their origin property, however.
Add the -R flag, and destroying the snapshot takes all the dependent
clones with it. You can delete the clone itself like any other filesystem
dataset.
Oh, wait. The clone inherited the zfs-auto-snapshot property from its
parent, so our snapshot automation caught it. If you didn’t want the clone
snapshotted, you should have turned that property off. You can manually
remove the clone’s snapshots, but zfs-auto-snapshot keeps creating new
ones. You can also use the -r (recursive) flag to destroy the clone and all
its snapshots.
dataset:
ZFS knows that the dataset mypool/bonnie and its snapshot of origin
depend on the mypool/wooly dataset. So we use the zfs promote command
to make bonnie the filesystem, and turn the old dataset into the clone.
Before promoting the clone, run zfs list and check the space usage
and parentage of both datasets involved.
The promotion should run silently. Go take a look at these two datasets
again.
The snapshot that mypool/bonnie was based on, and all snapshots older
than that snapshot of origin, now belong to mypool/bonnie. Newer
snapshots of mypool/wooly, taken after the snapshot mypool/bonnie was
created from, still belong to mypool/wooly.
You can now destroy the old dataset and all of its snapshots.
Remember that once a clone is forked from the main filesystem, it does
not get any updates from the parent. Any persistent data your application
needs should go in a different dataset. It can be a child dataset, as Jude
prefers. Lucas says that persistent data should go in a completely unrelated
dataset, just so a recursive remove doesn’t touch it. Watch out for your
persistent data in any way you prefer.
Safely Managing Clones, Snapshots, and Recursion
You can take snapshots of datasets. You can create clones based on those
snapshots. You can then take snapshots of the clones and create more
clones. Despite your best efforts, you’re likely to produce a massive tangle
of interrelated clones and snapshots that exceed any human’s ability to
mentally track. ZFS gives you a whole bunch of power and convenience,
but clones make possible brand new types of mayhem that will churn your
bowels.4
The -nv flags are vital to safe systems administration. Any time the
merest thought of destroying a dataset begins to consider the possibility of
crossing your mind, do a verbose dry run with -nv. See what the destroy
command would actually eliminate. Read the list. You might find that your
recursive destroy pulls on a thread of clones that stretches all the way
across the pool.
Test before you leap. Always.
ZFS changes how you use disk space, but it’s still the sysadmin’s task to
manage it. Let’s cover that next.
1 Snapshots make you the data equivalent of a hoarder. Do try to not get buried in an
avalanche of old newspapers.
3 Clones look like their source material. That’s why they make such good assassins. No,
wait—wrong clones. Sorry.
4 Admittedly, the big shoes and red noses don’t help. No, wait—that’s clowns. Sorry,
never mind.
Chapter 8: Installing to ZFS
The whole point of learning about ZFS is to use the filesystem on a
machine. Let’s discuss FreeBSD 10 installs on ZFS.
If you must install a whole slew of FreeBSD machines, such as at a
server farm, we recommend the PC-BSD installer. Scripts like we
demonstrate here are fine for occasional installs of a few machines, but if
you’re configuring machines by the rack, you really need a PXE-based
installer.
Your hardware limits your choices. With rare exceptions, laptops have
one hard drive. Running FreeBSD means using a single striped virtual
device pool for storage. If you have hundreds of disks, you’ll need to
consider how you want to separate your pools.
When you have many many disks, separate your operating system from
your data. FreeBSD and a good selection of add-on software fits nicely on a
mirrored pool or a RAID-Z. You don’t need RAID-Z3 for just the
operating system! If you have hundreds of data storage disks, use separate
pools for the operating system and data. With hundreds of disks, Lucas
would want a few separate pools, but he’s an old fogey. Jude would pour
them all into one giant pool. The only wrong choice is one that makes more
work for you.
This chapter assumes you’re familiar with GPT partitioning, FreeBSD
tools such as gpart(8), and FreeBSD partition types. If you’re not, check
the FreeBSD documentation or perhaps read FreeBSD Mastery: Storage
Essentials. (FMSE also covers installation scripts and other advanced
installation techniques.)
Installing a ZFS-based system requires configuring storage pools,
assigning datasets, and installing FreeBSD to the datasets. You can make
separate choices in each step, so we’ll consider each separately.
But start with a reference FreeBSD install.
FreeBSD Reference Install
Before installing your custom FreeBSD system, install a small FreeBSD
ZFS virtual machine as a reference platform. This offers lots of information
about a standard FreeBSD install. Installing your own system is great, but
don’t abandon all the carefully considered defaults the installer uses. Your
goal is probably to tweak the install in a way the installer doesn’t permit,
not abandon all FreeBSD standards.
Boot your reference platform, become root, and run zpool history to
see how this ZFS was created.
# zpool history
History for 'zroot':
2015-04-08.[Link] zpool create -o altroot=/mnt -O comp
2015-04-08.[Link] zfs create -o mountpoint=none zroot/
2015-04-08.[Link] zfs create -o mountpoint=/ zroot/ROO
2015-04-08.[Link] zfs create -o mountpoint=/tmp -o exe
...
hardware BIOS boots the boot loader, which activates the pool and fires up
the FreeBSD kernel. Every disk in every virtual device in the boot pool
should have a ZFS boot loader installed, which means the disks must be
partitioned. The maximum size of the FreeBSD boot loader partition is just
a smidgen over 512 KB for some daft reason, so assign 512 KB for the
boot loader. Then we put in a 1 GB FreeBSD swap partition, and assign the
remaining space for ZFS. The swap and ZFS partitions are aligned at 1 MB
boundaries.
While I’m using these short names for the GPT labels for teaching
purposes, we strongly encourage you to use location-based labels as
discussed in Chapter 0.
Now install the FreeBSD ZFS bootloader onto this disk. Every disk you
might boot from needs the bootloader.
Repeat this partitioning for every disk involved in the storage pool.
Pool Creation
Still at the disk formatting command prompt, assemble your disks into your
root storage pool. It’s common to call the root pool something like system
or zroot, or you might name it after the host. Do whatever fits for your
environment. I’m calling the example pool zroot, so that it fits with the
default used by the FreeBSD installer.
Look at the default FreeBSD install’s zpool history and consider what
it shows.
This is a FreeBSD 10.1 system. The installer mounts the boot pool
temporarily at /mnt, and we really have to keep that for the installer to
work. We want the other options, like setting compression to lz4 and
disabling atime. The -m none tells zpool(8) to not assign a mount point to
this pool. Using -f tells zpool(8) to ignore any other ZFS information on
these disks. The altroot property gives a temporary mount point, as
discussed in Chapter 4. You’re reinstalling pools, not recycling them.
The 10.1 installer wasn’t yet updated to take advantage of the
[Link].min_auto_ashift sysctl, but I’m going to use it now.
# sysctl [Link].min_auto_ashift=12
ZFS will now use 4096-byte sectors. Create the pool. We’re stealing all
of the default FreeBSD options, making only the changes we desire.
Chances are, FreeBSD’s default pool installation options are fine. You
probably want to tweak the datasets.
Datasets
While you want to create your custom dataset configuration, do check
FreeBSD’s installation defaults. They’re sensible for the average user, and
allow use of advanced features like boot environment managers.
If you want to complete your install with the FreeBSD installer, you
must give the installer a recognizable system. That means following steps
from the reference install, even if you’re not sure why a pool is exported
and imported at the end of the dataset creation step. In sum, we recommend
adding your own datasets, but leaving the defaults unchanged.
Here are a few bits from zpool history on a reference FreeBSD host,
omitting the timestamps.
You can easily add or change your own datasets to this, creating zroot/
var/mysql or moving /home out from under /usr or whatever it is you
desire.
Creating datasets is a lot of typing. We recommend creating installation
scripts, as discussed in FreeBSD Mastery: Storage Essentials.
Once you have your datasets, exit the command-line partitioning and the
installer will resume.
Post-Install Setup
Once the installer finishes copying files to your disks, setting up
networking, and so on, you’ll get a chance to go into the installed system
and make changes. Take that chance. You must change a few settings in the
finished system.
Make sure that ZFS is started in /etc/[Link]. This mounts your
filesystem datasets at boot.
zfs_enable=yes
zfs_load="YES"
You can also make any other system changes you like here.
While some documentation refers to other required steps, such as
copying the pool cache file, that’s no longer necessary.
Reboot when you’re finished, and you’ll come up in a new, customized
FreeBSD install!
Manually Installing FreeBSD
If you have to go to a command line to partition your disks, you might as
well install the FreeBSD files to the disk yourself. The FreeBSD
distribution files are in /usr/freebsd-dist, and you write them to your
disk with tar(1). Your installation target is mounted in /mnt.
You can install other distribution sets, but the base and kernel are the
only critical ones.
Your installation needs an /etc/fstab, for the swap files if nothing else.
Create it in /mnt/etc/fstab. You can also edit critical system files like /
mnt/etc/[Link] and /mnt/boot/[Link].
With a bit of work and testing, you can make your ZFS install as simple
or as complex as you like.
Exactly like you can ZFS.
Afterword
A whole book on a filesystem? Are you mad?
ZFS is merely a filesystem, yes. But it includes features that many
filesystems can’t even contemplate. You’d never try to wedge self-healing
into extfs, or variable block size into UFS2. Copy on write must be built
into the filesystem core—you can’t retrofit that into an existing filesystem.
By building on decades of experience with filesystems and targeting
modern and future hardware, ZFS has not only changed the way we
manage digital storage. It’s changed how we think about storage. ZFS’s
flexibility and power has even rendered many long-hallowed “system
administration best practices” obsolete. When your filesystem is an ally
rather than a menace, everything gets amazingly easier.
The continued growth of OpenZFS brings new features, steady
improvement, and a certain future that commercial vendors cannot provide.
OpenZFS gets more exciting every week, with new features and greater
performance. It also gets less exciting, in that it protects your data more and
more with each release. We are fortunate enough to be a part of this great
project, and are glad for this chance to share some of our excitement with
you.
While Lucas used ZFS before writing this book, Jude uses lots of ZFS
to deliver lots of content everywhere in the world. Jude is a FreeBSD doc
committer, but Lucas has written big stacks of books. Together, we’ve
created a stronger book than either of us could have alone.
And stay tuned for more ZFS from us. In FreeBSD Mastery: Advanced
ZFS, we’ll take you into strange new filesystem realms unimaginable just a
few years ago.
About the Authors
Allan Jude is VP of operations at ScaleEngine Inc., a global Video
Streaming CDN, where he makes extensive use of ZFS on FreeBSD. He is
also the host of the weekly video podcasts BSD Now (with Kris Moore)
and TechSNAP on [Link]. Allan is a FreeBSD
committer, focused on improving the documention and implementing libucl
and libxo throughout the base system. He taught FreeBSD and NetBSD at
Mohawk College in Hamilton, Canada from 2007-2010 and has 13 years of
BSD unix sysadmin experience.
Michael W Lucas is a full time author. His FreeBSD experience is
almost as old as FreeBSD. He worked for twenty years as a sysadmin and
network engineer at a variety of firms, most of which no longer exist. He’s
written a whole stack of technology books, which have been translated into
nine languages. (Yes, real languages. Ones that people actually speak.) You
can find him lurking at various user groups around Detroit, Michigan, his
dojo ([Link] or at [Link]
Find the authors on Twitter as @allanjude and @mwlauthor.
Never miss a new release!
[Link]
blogs [Link]
Twitter: @mwlauthor
Absolute BSD
Absolute OpenBSD (1st and 2nd edition)
Cisco Routers for the Desperate (1st and 2nd edition)
PGP and GPG
Absolute FreeBSD
Network Flow Analysis
SSH Mastery
DNSSEC Mastery
Sudo Mastery
FreeBSD Mastery: Storage Essentials
Networking for Systems Administrators
Tarsnap Mastery
FreeBSD Mastery: ZFS
FreeBSD Mastery: Advanced ZFS (coming 2015!)
FreeBSD Mastery: Specialty Filesystems (coming 2015!)
Table of Contents
Title Page
Publication Information
Acknowledgements
Dedication
Chapter 0: Introduction
What is ZFS?
ZFS History
Prerequisites
Where to Use ZFS?
ZFS Hardware
RAM
RAID Controllers
SATA vs. SAS vs. SSD
Disk Redundancy
Physical Redundancy
Disk Installation and Labeling
About this Book
Book Overview
Chapter 1: Introducing ZFS
ZFS Datasets
ZFS Partitions and Properties
ZFS Limits
Storage Pools
Virtual Devices
Blocks and Inodes
Chapter 2: Virtual Devices
Disks and Other Storage Media
Raw Disk Storage
Partition Storage
GEOM Device Storage
File-Backed Storage
Providers vs. Disks
VDEVs: Virtual Devices
VDEV Redundancy
Stripe (1 Provider)
Mirrors (2+ Providers)
RAID-Z1 (3+ Providers)
RAID-Z2 (4+ Providers)
RAID-Z3 (5+ Providers)
RAID-Z Disk Configurations
The RAID-Z Rule of 2s
Repairing VDEVs
RAID-Z versus traditional RAID
Special VDEVs
Separate Intent Log (SLOG, ZIL)
Cache (L2ARC)
How VDEVs Affect Performance
One Disk
Two Disks
Three Disks
Four or Five Disks
Six to Twelve Disks
Many Disks
Chapter 3: Pools
ZFS Blocks
Stripes, RAID, and Pools
Viewing Pools
Multiple VDEVs
Removing VDEVs
Pools Alignment and Disk Sector Size
Partition Alignment
ZFS Sector Size
FreeBSD 10.1 and Newer Ashift
Older FreeBSD Ashift
Creating Pools and VDEVs
Sample Drives
Striped Pools
Mirrored Pools
RAID-Z Pools
Multi-VDEV Pools
Using Log Devices
Mismatched VDEVs
Reusing Providers
Pool Integrity
ZFS Integrity
Scrubbing ZFS
Scrub Frequency
Pool Properties
Viewing Pool Properties
Changing Pool Properties
Pool History
Zpool Maintenance Automation
Removing Pools
Zpool Feature Flags
Viewing Feature Flags
Chapter 4: ZFS Datasets
Datasets
Dataset Types
Why Do I Want Datasets?
Viewing Datasets
Creating, Moving, and Destroying Datasets
Creating Filesystems
Creating Volumes
Renaming Datasets
Moving Datasets
Destroying Datasets
ZFS Properties
Viewing Properties
Changing Properties
Read-Only Properties
Filesystem Properties
atime
exec
readonly
setuid
User-Defined Properties
Parent/Child Relationships
Inheritance and Renaming
Removing Properties
Mounting ZFS Filesystems
Datasets without Mount Points
Multiple Datasets with the Same Mount Point
Pools without Mount Points
Manually Mounting and Unmounting Filesystems
ZFS and /etc/fstab
Tweaking ZFS Volumes
Space Reservations
Zvol Mode
Dataset Integrity
Checksums
Copies
Metadata Redundancy
Chapter 5: Repairs & Renovations
Resilvering
Expanding Pools
Adding VDEVs to Striped Pools
Adding VDEVs to Mirror Pools
Adding VDEVs to RAID-Z Pools
Hardware Status
Online
Degraded
Faulted
Unavail
Offline
Removed
Errors through the ZFS Stack
Restoring Devices
Missing Drives
Replacing Drives
Faulted Drives
Replacing the Same Slot
Replacing Unavail Drives
Replacing Mirror Providers
Reattaching Unavail and Removed Drives
Log and Cache Device Maintenance
Adding a Log or Cache Device
Removing Log and Cache Devices
Replacing Failed Log and Cache Devices
Exporting and Importing Drives
Exporting Pools
Importing Pools
Renaming Imported Pools
Incomplete Pools
Special Imports
Larger Providers
Zpool Versions and Upgrades
ZFS Versions and Feature Flags
Zpool Upgrades and the Boot Loader
FreeBSD ZFS Pool Limitations
Chapter 6: Disk Space Management
Reading ZFS Disk Usage
Referenced Data
Freeing Space
Disk Space Detail
Pool Space Usage
ZFS, df(1), and Other Traditional Tools
Limiting Dataset Size
Reservations
Viewing Reservations
Setting and Removing Reservations
Quotas
Dataset Quotas
Setting Quotas
Viewing Quotas
Exceeded Quotas
User and Group Quotas
Viewing Space Used and Existing Quotas per Dataset
Assigning and Removing User and Group Quotas
Viewing Individual Quotas
ZFS Compression
Enabling Compression
Compression Algorithms
Compression Properties
Choosing an Algorithm
When to Change Compression Algorithms
Compression and Performance
Deactivating Compression
Deduplication
Deduplication Memory Needs
Deduplication Effectiveness
Enabling Deduplication
Disabling Deduplication
Chapter 7: Snapshots and Clones
Copy-on-Write
How Snapshots Work
Using Snapshots
Creating a Snapshot
Dataset Changes and Snapshot Space
Recursive Snapshots
Advanced Dataset and Snapshot Viewing
View Datasets by Type
Modifying zfs list Output
Listing Snapshots by Default
Scripts and ZFS
Per-Snapshot Space Use
Accessing Snapshots
Secret Snapdir
Mounting Snapshots
Deleting Snapshots
Destruction Dry Runs
Recursion and Ranges
Rolling Back
Diffing snapshots
Automatic Snapshot Regimen
Rotation Schedule
ZFS Tools
zfs-auto-snapshot
Enabling Automatic Snapshots
Viewing Automatic Snapshots
Getting Clever with zfs-auto-snap
Holds
Bookmarks
Clones
Creating a Clone
Viewing Clones
Deleting Clones and Snapshots
Promoting Clones
Safely Managing Clones, Snapshots, and Recursion
Chapter 8: Installing to ZFS
FreeBSD Reference Install
Custom ZFS Installation Partitioning
Disk Partitioning
Pool Creation
Datasets
Post-Install Setup
Manually Installing FreeBSD
Afterword
About the Authors
Never miss a new release!
More Tech Books from Michael W Lucas