Zfs Dedup
Zfs Dedup
com/bonwick/en_US/entry/zfs_dedup
Ar ch iv e s
« November 2009
All General Slab Allocator ZFS
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
« ZFS in MacOS X Snow... | Main
8 9 10 11 12 13 14
15 16 17 18 19 20 21
Monday Nov 02, 2009 22 23 24 25 26 27 28
29 30
All
You knew this day was coming: ZFS now has built-in deduplication.
/General
/Slab Allocator
If you already know what dedup is and why you want it, you can skip the next couple of sections. For
/ZFS
everyone else, let's start with a little background.
Comments
What is it?
L an gu age s:
Deduplication is the process of eliminating duplicate copies of data. Dedup is generally either file-level,
English
block-level, or byte-level. Chunks of data -- files, blocks, or byte ranges -- are checksummed using some
hash function that uniquely identifies data with very high probability. When using a secure hash like 简体中文
SHA256, the probability of a hash collision is about 2^-256 = 10^-77 or, in more familiar notation,
0.00000000000000000000000000000000000000000000000000000000000000000000000000001. Español
For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory
error on the most reliable hardware you can buy. Русский
Chunks of data are remembered in a table of some sort that maps the data's checksum to its storage 日本語
location and reference count. When you store another copy of existing data, instead of allocating new
Português Brasil
space on disk, the dedup code just increments the reference count on the existing data. When data is
highly replicated, which is typical of backup servers, virtual machine images, and source code repositories,
deduplication can reduce space consumption not just by percentages, but by multiples. Search
File-level assigns a hash signature to an entire file. File-level dedup has the lowest overhead when the blogs.sun.com
natural granularity of data duplication is whole files, but it also has significant limitations: any change to Weblog
Login
any block in the file requires recomputing the checksum of the whole file, which means that if even one
block changes, any space savings is lost because the two versions of the file are no longer identical. This is
fine when the expected workload is something like JPEG or MPEG files, but is completely ineffective when
managing things like virtual machine images, which are mostly identical but differ in a few blocks. Re fe r r e r s
Today's Page Hits: 6882
Block-level dedup has somewhat higher overhead than file-level dedup when whole files are duplicated,
but unlike file-level dedup, it handles block-level data such as virtual machine images extremely well.
Most of a VM image is duplicated data -- namely, a copy of the guest operating system -- but some blocks
are unique to each VM. With block-level dedup, only the blocks that are unique to each VM consume
additional storage space. All other blocks are shared.
Byte-level dedup is in principle the most general, but it is also the most costly because the dedup code
must compute 'anchor points' to determine where the regions of duplicated vs. unique data begin and
end. Nevertheless, this approach is ideal for certain mail servers, in which an attachment may appear
many times but not necessary be block-aligned in each user's inbox. This type of deduplication is
generally best left to the application (e.g. Exchange server), because the application understands the data
it's managing and can easily eliminate duplicates internally rather than relying on the storage system to
find them after the fact.
ZFS provides block-level deduplication because this is the finest granularity that makes sense for a
general-purpose storage system. Block-level dedup also maps naturally to ZFS's 256-bit block checksums,
which provide unique block signatures for all blocks in a storage pool as long as the checksum function is
cryptographically strong (e.g. SHA256).
1 of 13 11/06/2009 02:21 AM
ZFS Deduplication : Jeff Bonwick's Blog http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
ZFS deduplication is synchronous. ZFS assumes a highly multithreaded operating system (Solaris) and a
hardware environment in which CPU cycles (GHz times cores times sockets) are proliferating much faster
than I/O. This has been the general trend for the last twenty years, and the underlying physics suggests
that it will continue.
If you have a storage pool named 'tank' and you want to use dedup, just type this:
That's it.
Like all zfs properties, the 'dedup' property follows the usual rules for ZFS dataset property inheritance.
Thus, even though deduplication has pool-wide scope, you can opt in or opt out on a per-dataset basis.
If your data doesn't contain any duplicates, enabling dedup will add overhead (a more CPU-intensive
checksum and on-disk dedup table entries) without providing any benefit. If your data does contain
duplicates, enabling dedup will both save space and increase performance. The space savings are obvious;
the performance improvement is due to the elimination of disk writes when storing duplicate data, plus
the reduced memory footprint due to many applications sharing the same pages of memory.
Most storage environments contain a mix of data that is mostly unique and data that is mostly replicated.
ZFS deduplication is per-dataset, which means you can selectively enable dedup only where it is likely to
help. For example, suppose you have a storage pool containing home directories, virtual machine images,
and source code repositories. You might choose to enable dedup follows:
Trust or verify?
If you accept the mathematical claim that a secure hash like SHA256 has only a 2^-256 probability of
producing the same output given two different inputs, then it is reasonable to assume that when two
blocks have the same checksum, they are in fact the same block. You can trust the hash. An enormous
amount of the world's commerce operates on this assumption, including your daily credit card
transactions. However, if this makes you uneasy, that's OK: ZFS provies a 'verify' option that performs a
full comparison of every incoming block with any alleged duplicate to ensure that they really are the
same, and ZFS resolves the conflict if not. To enable this variant of dedup, just specify 'verify' instead of
'on':
Selecting a checksum
Given the ability to detect hash collisions as described above, it is possible to use much weaker (but
faster) hash functions in combination with the 'verify' option to provide faster dedup. ZFS offers this
option for the fletcher4 checksum, which is quite fast:
The tradeoff is that unlike SHA256, fletcher4 is not a pseudo-random hash function, and therefore
2 of 13 11/06/2009 02:21 AM
ZFS Deduplication : Jeff Bonwick's Blog http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
cannot be trusted not to collide. It is therefore only suitable for dedup when combined with the 'verify'
option, which detects and resolves hash collisions. On systems with a very high data ingest rate of largely
duplicate data, this may provide better overall performance than a secure hash without collision
verification.
Unfortunately, because there are so many variables that affect performance, I cannot offer any absolute
guidance on which is better. However, if you are willing to make the investment to experiment with
different checksum/verify options on your data, the payoff may be substantial. Otherwise, just stick with
the default provided by setting dedup=on; it's cryptograhically strong and it's still pretty fast.
Most dedup solutions only work on a limited amount of data -- a handful of terabytes -- because they
require their dedup tables to be resident in memory.
ZFS places no restrictions on your ability to dedup. You can dedup a petabyte if you're so inclined. The
performace of ZFS dedup will follow the obvious trajectory: it will be fastest when the DDTs (dedup tables)
fit in memory, a little slower when they spill over into the L2ARC, and much slower when they have to be
read from disk. The topic of dedup performance could easily fill many blog entries -- and it will over time --
but the point I want to emphasize here is that there are no limits in ZFS dedup. ZFS dedup scales to any
capacity on any platform, even a laptop; it just goes faster as you give it more hardware.
Acknowledgements
Bill Moore and I developed the first dedup prototype in two very intense days in December 2008. Mark
Maybee and Matt Ahrens helped us navigate the interactions of this mostly-SPA code change with the
ARC and DMU. Our initial prototype was quite primitive: it didn't support gang blocks, ditto blocks, out-of-
space, and various other real-world conditions. However, it confirmed that the basic approach we'd been
planning for several years was sound: namely, to use the 256-bit block checksums in ZFS as hash
signatures for dedup.
Over the next several months Bill and I tag-teamed the work so that at least one of us could make
forward progress while the other dealt with some random interrupt of the day.
As we approached the end game, Matt Ahrens and Adam Leventhal developed several optimizations for
the ZAP to minimize DDT space consumption both on disk and in memory, key factors in dedup
performance. George Wilson stepped in to help with, well, just about everything, as he always does.
For final code review George and I flew to Colorado where many folks generously lent their time and
expertise: Mark Maybee, Neil Perrin, Lori Alt, Eric Taylor, and Tim Haley.
Our test team, led by Robin Guo, pounded on the code and made a couple of great finds -- which were
actually latent bugs exposed by some new, tighter ASSERTs in the dedup code.
My family (Cathy, Andrew, David, and Galen) demonstrated enormous patience as the project became
all-consuming for the last few months. On more than one occasion one of the kids has asked whether we
can do something and then immediately followed their own question with, "Let me guess: after dedup is
done."
Well, kids, dedup is done. We're going to have some fun now.
Comments:
Really great news. While other filesystems try to get at the point where ZFS was yesterday, ZFS moves ahead.
"When using a secure hash like SHA256, the probability of a hash collision is about 2^-256"
Will dedup speed up copying or moving files from one dataset to another? If yes, will it result in read-only
activity on the disks when moving the file or, even better, in only increasing the reference count of the blocks?
3 of 13 11/06/2009 02:21 AM
ZFS Deduplication : Jeff Bonwick's Blog http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
Well, now you just need to dedup your time schedule, and you'll have a lot of spare time blocks laying
around that you can use for your kids!
Birthday collisions, having only 365 values, fully utilize less than 15 bits. 256 bit offers substantially more
values. Feel free to recalculate the birthday formulas with 2^256 instead of 365 and post the result if you
like.
Does Dedup also apply to data in the L2ARC? If so... wow o_0
You mentioned, that probably everyone wants to have the DDTs in memory. So a little formula like
zfs_size/blocksize * N ~ DDT size would be helpful as well...
Bahamat: did you read the link? It includes probabilities for 256 bit hashes. With a 256 bit hash, there's a
1-in-a-billion chance of a collision with 1.5 × 10^34 blocks. How many blocks might there be in a ZFS
array?
Excellent news; for starters this should give us the space benefits of a sparse zone with the flexibility of a full
root one.
@max, @bastien -
There is a discussion of the overall chance of hash collision when factoring in the total number of blocks in
the ARC thread for a related, but orthogonal case:
http://arc.opensolaris.org/caselog/PSARC/2009/557/mail
http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table
To have a collision probability of 10^-18 (already more reliable than almost anything else in the system), this
would require approximately 2^98 unique blocks (2^115 bytes @128k) to be written, well beyond the limits
for any forseeable storage platform.
The problem with using a hash function is that attackers control a lot of data on your filesystem on a modern
OS. Consider the web cache on a machine where a user browses the web. This allows an attacker a platform
to intentionally try to cause collisions that can cause the filesystem to malfunction. These sort of attacks are
still hard, but there has been a lot of progress in attack several popular hash functions lately. A solution to
prevent this is to use a keyed hash function and keep the key secret.
Do you have any advice or observations on how dedup interacts with ZFS compression?
If the implementation of the SHA256 ( or possibly SHA512 at some point ) algorithm is well threaded then
one would be able to leverage those massively multi-core Niagara T2 servers. The SHA256 hash is based on
six 32-bit functions whereas SHA512 is based on six 64-bit functions. The CMT Niagara T2 can easily
process those 64-bit hash functions and the multi-core CMT trend is well established. So long as context
switch times are very low one would think that IO with a SHA512 based de-dupe implementation would be
possible and even realistic. That would solve the hash collision concern I would think.
4 of 13 11/06/2009 02:21 AM
ZFS Deduplication : Jeff Bonwick's Blog http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
How long until we can actually use this feature in the development builds? Canʼt wait to try it out!
"Excellent news; for starters this should give us the space benefits of a sparse zone with the flexibility of a full
root one."
Unless the deduplication also spills over into the memory management, it doesn't: Running two deduplicated
full root zones still requires twice the RAM than running only one of them, while running two sparse zones
only require twice the read/write pages, with the read-only pages being shared.
So the questions are: How often does Solaris load a couple of identical pages to memory, when they're
deduplicated on disk? Are there plans to get that to "once", if it isn't already?
Real-life scenario: We run 60 sparse zones vs. the 16 full zones that the system could manage before we're
out of RAM (4GB installed).
And the difference only grows with more RAM.
Does the deduping apply to the ARC/L2ARC as well (ie, only pointers to duplicate blocks reside, rather than
the whole block)?
Any ideas when this will show up in releases of OpenSolaris or Solaris 10?
One question I thought of while reading is do different pools recognize the data in the other pool for
deduplication. For example, if I have pools A and B (deduplication is enabled in each) and I have an identical
block X in both pools, will it be deduplicated? There are pro's and con's to deduplicating across pools and
not deduplicating across pools. Which did you choose and will there be options in the future to do the
opposite?
@Don - The ARC work (so that we deduplicate in-core) is forthcoming. Mark Maybee is making good
progress on that.
And yes, it works perfectly well with compression - just as you would imagine. :)
Congrats Jeff,
I'm looking forward to snv_128 on IPS. Is there any way you can hurry the binary images over to IPS?
My Apologies,
newsham: as far as the world's open cryptographic community knows, it is impossible for anyone to generate
collisions in SHA-256 even if they are deliberately trying to do so.
Dennis: Hm, can't ZFS use the hardware implementation of SHA-256 in the T2? Anyway, "the hash collision
concern" is already solved by SHA-256 -- see Eric Shrock's comment.
all: it seems like there are some funny interactions between dedupe and crypto: http://mail.opensolaris.org
/pipermail/zfs-crypto-discuss/2009-November/002947.html . I'm glad to learn from this blog post that
Nicolas Williams was wrong to say that dedupe will always require full block comparison.
So, is this going to be included in the next update of solaris or are we going to have to download some kind
of a kernel patch soon?
5 of 13 11/06/2009 02:21 AM
ZFS Deduplication : Jeff Bonwick's Blog http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
Thanks.
This is something I always wished. Some people rant about differential copies (copy on write of files) or
sparse files, but the problem of these schemes is that they tend to not survive the file being copied on a
different disk, or at any rate they work only if the filesystem has knowledge of the original creation of the
redundancy. It is not the case here, and so is much more general. Imagine, one can now write a file of one
petabyte of zeroes - without the application telling the filesystem (other than by writing said zeroes).
Fun question: would it be possible for a malicious user to try and write blocks that he suspects are the same
as blocks from data he's not supposed to be able to read, and figure out if a deduplication occurred by
timing the process? Would be an interesting attack. (note: has nothing to with finding a collision with whatever
hash function is used, though these attacks are interesting as well; as far as I am aware, so far ZFS was only
dependent on the hash function resistance to pre-image attacks, if of course ZFS was supposed to guarantee
cryptographic integrity; but now the hash function must also resist collision attacks or fun things might
occur...)
One drawback of the current support however: given a huge array such that the dedup table has to spill to
disk, which is used in bursts for, say, backup, and since browsing the table (and adding entries to it) has no
locality whatsoever (otherwise, it means there is a problem with the hash function, if I'm not mistaken), it will
have to hit the disk with the same proportion as the proportion of data not in memory over the total size of
the table (or in other words: you cannot efficiently cache that table in memory), then an offline option would
have been useful for that use case.
I'm curious about the claim of increased performance across the board -- doesn't read performance suffer
from the transformation from what was once a continuous read into many short reads in different areas of
the disk? Or perhaps you have a strategy to deal with that, or don't find this fragmentation to be a problem
in practice?
Thanks.
Well done Jeff. You seem to be involved with all the best work ;)
Re:"One question I thought of while reading is do different pools recognize the data in the other pool for
deduplication. "
I doubt that dedup has the ability to be done across pools. Reason a is that it's a ZFS flag to turn it on rather
than a zpool one. Reason b is that you can't be sure that the second pool will not be removed or even
exported. I prefer it this way.
Birthday paradox arguments aside, cryptographic hashes have a long history of being broken. If this ever
happens to SHA-256, the Mother of All Remote privilege escalations will immediately apply. E.g.,a collision
with a block of the Windows kernel would let *a web site* run privileged code on however-many-hundreds of
VMs hang off the corrupted physical block (assuming web caches get written to virtual disk).
The verify flag provides an out here, but it is not the default, and most users will take that hint. Unfortunately,
there is no way to know whether you should have used verify until it is too late; if SHA-256 is ever broken,
then a non-verify pool is defective.
I realize you all are smart cookies, and have thought much more deeply about this than I have just reading
this blog post. So what's the punchline? Are we that much more confident about SHA-256 than its
predecessors? Is the performance hit to verify enough to make the system useless for important domains?
Some combination of the two?
@Keith Adams:
It's not _that_ bad. For your exploit to work, you'd need to have the faulty file around _first_.
Deduplication doesn't overwrite existing files with a duplicate, but avoids writing duplicates in the first place.
(at least as far as I understand it)
6 of 13 11/06/2009 02:21 AM
ZFS Deduplication : Jeff Bonwick's Blog http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
So, the attack vector would require you to know beforehand of a new component that's normally ran with
elevanted rights. Then push a file with the same hash to the system, and then wait for deduplication to kick
in.
That's a whole lot harder than "hey, I can write a file with the same content as the kernel".
It's possible to scan the hash space: Write all kinds of files, hash with a different hash in RAM and on disk, if
they differ, you got a collision with another file.
And then hope for it to be something truly secret ;-)
But given the size of the hash space, that's not a productive use of your time, and it would be too easy to
figure out that there's something fishy going on (who's creating and deleting billions of files all the time?)
RT
www.complete-privacy.at.tc
@zooko: indeed, ZFS dedup does have the option of not verifying block contents when hashes match -- I
spoke too quickly. A minor error, I think.
@{Zooko, various others}: The point is that if you don't want to trust the hash function, well, you don't have to.
@Zooko: Back to the ZFS crypto issue that Darren was asking for advice on: by MACing every block in
addition to hashing it we don't depend on the hash function's collision resistance for security, though, clearly,
for dedup you'll want to enable block contents verification if you do fear attackers that can create hash
collisions. IMO, not depending on a hash function's collision resistance, is a good thing.
A quick&simple suggestion wrt "Trust or verify?" and, specifically, using fast fletcher hash with subsequent
positve(s) verification by byte-comparison. There is third option: always look-up by fletcher hash (assuming
blocks are indexed by both this AND sha-256), and use sha to verify positives.
This has slight advantage over verification by byte-comparison, especially with multiple positives, and huge -
fletcher over sha - win in negative cases. The latter yields a nice property: the less duplicated is the data,
the lesser is dedupe's overhead.
That's nice an all... but for my purposes, it would be far more of a "win", to allow for cross-zfs(but same
pool) hardlinks. And/or mv's.
Not to mention some kind of zfs-aware rsync. For fast, efficient remote replication (or restorals, for that
matter!), that does not require keeping a matching "full-filesystem snapshot" around).
Maybe with this dup-detection stuff, you will be closer to having that happen now?
What a lot of others asked: Where and when are we, the general public, going to be able to see it and test it
ourselves? Your blog speaks as if it's readily available, yet I don't see it in Update 8 of Sol10, nor in recent
snapshots of OpenSolaris. Am I missing something?
Thanks
I stand corrected:
http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010683.html
7 of 13 11/06/2009 02:21 AM
ZFS Deduplication : Jeff Bonwick's Blog http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
The probability of a collision for an equiprobable hash function is ~ 2^(-n/2), where "n" is the size of the
output hash in bits. The probability is _not_ 2^-n (that's the probability of a getting a single output, not the
probability of two input documents producing the same output which is exactly a collision).
For more information about collisions for cryptographical functions please read about the Birthday attack. For
example: http://en.wikipedia.org/wiki/Birthday_attack.
Did you consider side-channel attacks? Let us assume that an attacker that is a user on a machine knows
that somewhere on disk there is a block that contains just a username and a password, say "john:abc"
Now he could write his own combinations "john:aaa", "john:aab", .. ,"john:zzz" each to a unique block in a file
on disk and notice that the timing of writing "john:abc" is different to all the others, because the block
"john:abc" is deduped?
@Wrex: You should see it in OpenSolaris dev builds (i.e. http://pkg.opensolaris.org/dev) in roughly a month.
The current "build" (build 128) closes for code changes on 9 November, then it gets some QA, and then is
published. We just pushed build 126 publicly last week.
More thoughts: How about using dedup processing, to then enable synthetic snapshot creation across
systems?
Example: two separate solaris systems, both running zfs filesystems that have "mostly similar" content, and
need to be kept near-synced in future.
Rather than having to completely blow away and rebuild one, to then have a shared full snapshot for
incremental zsends... how about some kind of tool that would create one?
Reasons this would be worthwhile, could be large datasets, and bandwidth-constrained interconnects.
and/or: user error. They were previously kept in sync with zsync, but some admin accidentally deletes the
"wrong" snapshot. or all snapshots. That admin is going to have some very very unhappy users, unless
there's a nice neat way to regen the common snapshot without long downtimes for rebuild.
For all the people that will climb down the hole of hash function probability it would be interesting to contrast
that with just bit rot on modern drives.
@Klaus:
Timing it may be unreliable, especially on a busy system but someone could just dtrace it to figure it out. It
might make sense to have a nodedup option applied at file level so developers have a chance to adress such
concerns for sensitive files.
We have no idea on the suitability of our data for dedup. Is there any method available that can scan the
data and report on the suitability of switching dedup on?
Super news!! :D
@Felipe Alfaro Solana: The "probability of a collision" is NOT about 2^(n/2). That is the approximate number
of items one needs to hash with an n-bit hash function for the probability of a collision to be about 50%.
The critical issue, which a few people have touched upon, is that the probability of "a collision" depends on
how much data is being hashed. If we have a zetabyte of data and a 128K block size we can end up
needing to hash up to 2^53 (about 10^16) blocks. The probability of getting any collisions on a 256 bit hash
function with this many tries is about 1 in 2^151 (less than 1 in 10^46). This is dozens of orders of
magnitude safer than the underlying disc drives.
Re side channel attacks - seems to me they are plausible both by measuring system response time and by
measuring available space on the volume(s) after writing the suspect block(s). Even on busy system, it will be
possible to identify trends. (e.g., if we write block "X" 1000 times, it is always much faster than if we write
8 of 13 11/06/2009 02:21 AM
ZFS Deduplication : Jeff Bonwick's Blog http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
block "Y" 1000 times; same for space usage.) These attacks could be mitigated by making the actual
response time equal to the expected worst-case scenario, and by limiting "space available" responses to
"available under quota" versus "absolute space available", but both approaches would create other side
effects such as slower response time(s) and forcing the use of quotas.
It could also be used to identify media files (music, video, whatever) if there's a typical/canonical encoding
that's likely to be used.
Might be useful for identifying executable/library/data files installed on a machine (perhaps at the version
level to identify vulnerabilities) if the attacker can work with a known example; or for iteratively determining
the contents of a sensitive file such as a *nix password file (discussed above by Klaus Borchers) or a file
containing a database access password, passphrase, .htaccess file (or password file pointed to by .htaccess
file), etc.
I am a fan of the deduplication idea, but I think it has significant security implications that may be tough to
identify early.
Jeff,
Congratulations to you and your team! You have reenergized my enthusiasm for Solaris and you have given
my business a strong reason go to w/ Sun Solaris in lieu of the competition.
Joe
I have a large ZFS pool (on a Storage 7000 system) containing virtual machine images. I'm sure the dedup
stuff will be very useful for this kind of data. I'm wondering, though, will there be some way of forcing existing
data through the algorithm, or will only data that was written after dedup was turned on benefit?
1. What happens when the DDTs cannot fit in memory? How many extra I/Os will be needed in this case for
each application I/O?
2. Given the common case that two blocks are very similar but not identical, how does ZFS handle this?
So what happens when you lose a block on the disk that happened to be the reference block for 12 others?
First thing I'd want to know before letting this option anywhere near my production filesystems is how much
redundancy is there and how easy is it to configure ?
Does this open a backup poisening attack ? If i write the same blocks in sequence in one big or several small
files, with a sequence just long enough to fool the compression system of the backup, would this enable an
attacker to exhaust backup space ?
If the backup space is another deduped ZFS-System, would this enable an attacker to exhaust the
communications capability ? If there is a quota on the accounts, this is not a real problem, but with unlimited
accounts, it may be.
9 of 13 11/06/2009 02:21 AM
ZFS Deduplication : Jeff Bonwick's Blog http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
Off course, most of the time it is not an attacker, but an idiot running untested software, without having a
look at it, for extended periods of time.
Can ZFS trigger a warning, if a block happens to be referenced for more than setable number of times ?
Sophed,
There is a mechanism for that. If a block is referenced by 1000 other blocks you have a severe problem if
that block corrupts. Therefore you will be able to specify how many references is allowed to a block. Saying
something like, "a block can not have more than 10 references" or something similar.
But, ZFS is very secure and for the redudancy you use raidz2 or raidz3, of course. With raidz2 (raid-6) you
will get lots of redundancy.
Jeff,
Good work! BTW, jeff, did you know that Solaris is the best OS out there? :o) Truly.
Dreaming about a BitTorrent client that uses dedup to find chunks on disk before trying to download them.
Does the p value for collisions hold true for blocks that only differ in a single bit?
and
Do you/have you empirically tested your implementation in some way to verify collision frequency...?
--
David Strom
Awesome work! Congratulations! Can't wait to try this out. By the way, does this mean that data deduplication
software are going to be pushed aside?
"You should see it in OpenSolaris dev builds (i.e. http://pkg.opensolaris.org/dev) in roughly a month. The
current "build" (build 128) closes for code changes on 9 November, then it gets some QA, and then is
published. We just pushed build 126 publicly last week."
That's great, but OpenSolaris is so GNU/Linux bastardized and changing so fast, that's both illogical and
impractical to use it in production.
So I have to ask... Why use an expensive hashing algorithm at all? Why not use a cheaper hash, but use trust
+ verify before actually performing a deduplication? This will reduce processor load and eliminate the use of
collisions in the event two pieces of data hash the same but are in fact different.
On my system, MD5 costs 1/3 of what SHA256 costs in CPU time, and while it may be more likely to cause
collisions, ZFS should always do a byte-to-byte comparison on any block before deduping one to make sure
they are, in fact, identical.
For some real world hash collision examples you may want to try using backuppc on your data. It includes
deduplication, but runs at a file level.
10 of 13 11/06/2009 02:21 AM
ZFS Deduplication : Jeff Bonwick's Blog http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
I was rather surprised to find that in a small system (a few TB), I'm running into quite a few files that have the
same hash but different contents.
# Pool is 2698.23GB comprising 1232264 files and 4369 directories (as of 11/3 10:16),
# Pool hashing gives 2651 repeated files with longest chain 51,
So, I have 3TB of data, and one of the checksums has 51 collisions that were detected (51 different files that
had the same hash but different contents).
At first I was surprised that there were any collisions, but then I remembered the birthday problem...
In any case, it seems like "verify" is the most conservative setting, and I'm surprised that the file-system that
basically promises to never corrupt your data defaults to a setting in which this could happen.
Sean
By default, fletcher4 is used when creating new zfs file-systems. Many users like myself have already filled
up a bunch of zfs file-systems using fletcher4. What happens if a user tries to enable dedup but doesn't use
verify for an existing fletcher4 filesystem? Is there a warning/error?
For the paranoid and performance conscious, I could see wanting to do dedup with verification in a batch
job, run at least nightly.
@Sean Reifschneider
What hash function is being used by software you mentioned?
@Greg: Come to think of it, the easiest exploit would probably be to use the "readahead" features of the
system, where not single blocks (say 4K) are read from disk, but larger sectors (say 64K).
Reading sixteen 4k-blocks on an even sector boundary, the first block, if not in cache, will take at least
several msecs(disk access time), while the 15 subsequent blocks will be available in usecs.
In order to find out if a block with a certain content exists anywhere on disk, from within, say, a VM, you just
write a sector with 15 blocks of random garbage onto the disk, but one block somewhere in the middle, e.g.
#9, contains the contents you want to check. After a few minutes to hours, when the data you have written
has been evicted from cache in ram, you read back the first block, which takes a long time, and then the
others, which follow almost immediately - except for #9, which was deduplicated, and must be fetched from
somewhere else on disk.
In practice, I guess, with all the optimisations and layers in the system, it may be far more complicated, but
securing the system against this kind of attack will be even more difficult.
If this is turned on for an existing ZFS file system are pre-existing segments able to be deduplicated? If ZFS
only supports synchronous dedupe does the data have to be pulled off and repopulated?
Hi
As we rely heavily on ZFS for more than 2 Billion of Files in different Filesystems my question is: is dedup
working across filesystems?
and is there a way to get the non dedup ratio compared to the dedup ratio on a zfs filesystem?
could we get the "not deduped" size of a filesystem? with df or do we receive the deduped size?
Michael
First of all, thank you all for this feature, and I hope you and your families will get some quality time together
11 of 13 11/06/2009 02:21 AM
ZFS Deduplication : Jeff Bonwick's Blog http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
And I also thank the commenters for raising some interesting questions and ideas.
Concerning the matter of deduplicating data that's already on our disks, in our existing systems, I'm afraid
we'd have to suffice for a while with a trick I use to compress previously uncompressed files (say, local
zones). We shut down the zone, move its files to a subdirectory, and use Sun cpio (keeping ZFSacl's) to copy
the files back to their expected location. Upon write, they are compressed (and nowadays they'd also be
deduped). This can be tweaked to doing per-file copies/removals to minimize the free-space pressure
when remaking existing systems.
Needless to say, some supported utility which allows to (un-)compress and (un-)dedup existing files in-place
(like setting the Compressed flag on NTFS objects) is very welcome and long-awaited :) The already de-facto
working capability of using different compression algorithms within the same ZFS dataset is also a bonus
versus NTFS, and I'd love to see that in said utility. (In example of our local zones, the fresh install of binary
files can be done with gzip-9, then new files like data and logs are written with a faster lzjb.)
Another question arises: what if we have same files (blocks) compressed with different algorithms? On-disk
blocks apparently contain different (compressed) data and have different checksums for the same original
data, and different amount of on-disk blocks for the same original files.
These would probably not dedup ultimately to one block, but to at least as many as there are different
compression algorithms?
And for compressed blocks inside different original files (including VM images) the block-alignment would
make it even less probable that we have dedupable whole blocks? Even a one-byte offset would not let us
save space from otherwise same original data?
In short - does it mean we would (probably) save more space by not compressing certain filesystems (i.e. VM
image containers) but rather only deduping them?
> Well, now you just need to dedup your time schedule, and you'll have a lot of spare time blocks laying
around that you can use for your kids!
Apparently, this strikes the family time too. Instead of going with kids to a zoo 10 times, "Jeff" only goes once
and tells the family that they should remember it as 10 different trips ;)
I know about checksum and how it's work for RAID-Z. But if ZFS can use sha256 for duplicates searching
maybe they can use it for self-healing instead deduplicating?
@QuAzI
Sha256 in ZFS IS used for self healing if you set it as checksum algorithm.. It is used, instead of fletcher, to
check integrity of every block in datasets, not only in RAIDZ. It was like that since creation of ZFS, and now
this same checksum field can be used for two purposes, integrity and deduplication.
Thanks. I don't found that in overviews and documents, just examples of self-healing of mirrors found. It's
good.
Out of curiosity, how does dedup interact with userquota/groupquota? Will the full size of the deduped block
count against the quota? I'm guessing that's the case as it's more efficient, though really the user/group isn't
using all that space.
Dedup sounds great. I'm really looking forward to that and comstar making their way into the 7000 series.
Thanks for your work.
12 of 13 11/06/2009 02:21 AM
ZFS Deduplication : Jeff Bonwick's Blog http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
Post a Comment:
Name: E-Mail:
Your Comment:
Remember
Information?
Preview Post
13 of 13 11/06/2009 02:21 AM