Tag Archives: zfs

Silent Data Corruption Is Real

Here’s something you never want to see:

ZFS has detected a checksum error:

   eid: 138
 class: checksum
  host: alexandria
  time: 2017-01-29 18:08:10-0600
 vtype: disk

This means there was a data error on the drive. But it’s worse than a typical data error — this is an error that was not detected by the hardware. Unlike most filesystems, ZFS and btrfs write a checksum with every block of data (both data and metadata) written to the drive, and the checksum is verified at read time. Most filesystems don’t do this, because theoretically the hardware should detect all errors. But in practice, it doesn’t always, which can lead to silent data corruption. That’s why I use ZFS wherever I possibly can.

As I looked into this issue, I saw that ZFS repaired about 400KB of data. I thought, “well, that was unlucky” and just ignored it.

Then a week later, it happened again. Pretty soon, I noticed it happened every Sunday, and always to the same drive in my pool. It so happens that the highest I/O load on the machine happens on Sundays, because I have a cron job that runs zpool scrub on Sundays. This operation forces ZFS to read and verify the checksums on every block of data on the drive, and is a nice way to guard against unreadable sectors in rarely-used data.

I finally swapped out the drive, but to my frustration, the new drive now exhibited the same issue. The SATA protocol does include a CRC32 checksum, so it seemed (to me, at least) that the problem was unlikely to be a cable or chassis issue. I suspected motherboard.

It so happened I had a 9211-8i SAS card. I had purchased it off eBay awhile back when I built the server, but could never get it to see the drives. I wound up not filling it up with as many drives as planned, so the on-board SATA did the trick. Until now.

As I poked at the 9211-8i, noticing that even its configuration utility didn’t see any devices, I finally started wondering if the SAS/SATA breakout cables were a problem. And sure enough – I realized I had a “reverse” cable and needed a “forward” one. $14 later, I had the correct cable and things are working properly now.

One other note: RAM errors can sometimes cause issues like this, but this system uses ECC DRAM and the errors would be unlikely to always manifest themselves on a particular drive.

So over the course of this, had I not been using ZFS, I would have had several megabytes of reads with undetected errors. Thanks to using ZFS, I know my data integrity is still good.

Backing up every few minutes with simplesnap

I’ve written a lot lately about ZFS, and one of its very nice features is the ability to make snapshots that are lightweight, space-efficient, and don’t hurt performance (unlike, say, LVM snapshots).

ZFS also has “zfs send” and “zfs receive” commands that can send the content of the snapshot, or a delta between two snapshots, as a data stream – similar in concept to an amped-up tar file. These can be used to, for instance, very efficiently send backups to another machine. Rather than having to stat() every single file on a filesystem as rsync has to, it sends effectively an intelligent binary delta — which is also intelligent about operations such as renames.

Since my last search for backup tools, I’d been using BackupPC for my personal systems. But since I switched them to ZFS on Linux, I’ve been wanting to try something better.

There are a lot of tools out there to take ZFS snapshots and send them to another machine, and I summarized them on my wiki. I found zfSnap to work well for taking and rotating snapshots, but I didn’t find anything that matched my criteria for sending them across the network. It seemed par for the course for these tools to think nothing of opening up full root access to a machine from others, whereas I would much rather lock it down with command= in authorized_keys.

So I wrote my own, called simplesnap. As usual, I wrote extensive documentation for it as well, even though it is very simple to set up and use.

So, with BackupPC, a backup of my workstation took almost 8 hours. (Its “incremental” might take as few as 3 hours) With ZFS snapshots and simplesnap, it takes 25 seconds. 25 seconds!

So right now, instead of backing up once a day, I back up once an hour. There’s no reason I couldn’t back up every 5 minutes, in fact. The data consumes less space, is far faster to manage, and doesn’t require a nightly hours-long cleanup process like BackupPC does — zfs destroy on a snapshot just takes a few seconds.

I use a pair of USB disks for backups, and rotate them to offsite storage periodically. They simply run ZFS atop dm-crypt (for security) and it works quite well even on those slow devices.

Although ZFS doesn’t do file-level dedup like BackupPC does, and the lz4 compression I’ve set ZFS to use is less efficient than the gzip-like compression BackupPC uses, still the backups are more space-efficient. I am not quite sure why, but I suspect it’s because there is a lot less metadata to keep track of, and perhaps also because BackupPC has to store a new copy of a file if even a byte changes, whereas ZFS can store just the changed blocks.

Incidentally, I’ve packaged both zfSnap and simplesnap for Debian and both are waiting in NEW.

Why and how to run ZFS on Linux

I’m writing a bit about ZFS these days, and I thought I’d write a bit about why I am using it, why it might or might not be interesting for you, and what you might do about it.

ZFS Features and Background

ZFS is not just a filesystem in the traditional sense, though you can use it that way. It is an integrated storage stack, which can completely replace the need for LVM, md-raid, and even hardware RAID controllers. This permits quite a bit of flexibility and optimization not present when building a stack involving those components. For instance, if a drive in a RAID fails, it needs only rebuild the parts that have actual data stored on them.

Let’s look at some of the features of ZFS:

  • Full checksumming of all data and metadata, providing protection against silent data corruption. The only other Linux filesystem to offer this is btrfs.
  • ZFS is a transactional filesystem that ensures consistent data and metadata.
  • ZFS is copy-on-write, with snapshots that are cheap to create and impose virtually undetectable performance hits. Compare to LVM snapshots, which make writes notoriously slow and require an fsck and mount to get to a readable point.
  • ZFS supports easy rollback to previous snapshots.
  • ZFS send/receive can perform incremental backups much faster than rsync, particularly on systems with many unmodified files. Since it works from snapshots, it guarantees a consistent point-in-time image as well.
  • Snapshots can be turned into writeable “clones”, which simply use copy-on-write semantics. It’s like a cp -r that completes almost instantly and takes no space until you change it.
  • The datasets (“filesystems” or “logical volumes” in LVM terms) in a zpool (“volume group”, to use LVM terms) can shrink or grow dynamically. They can have individual maximum and minimum sizes set, but unlike LVM, where if, say, /usr gets bigger than you thought, you have to manually allocate more space to it, ZFS datasets can use any space available in the pool.
  • ZFS is designed to run well in big iron, and scales to massive amounts of storage. It supports SSDs as L2 cache and ZIL (intent log) devices.
  • ZFS has some built-in compression methods that are quite CPU-efficient and can yield not just space but performance benefits in almost all cases involving compressible data.
  • ZFS pools can host zvols, a block device under /dev that stores its data in the zpool. zvols support TRIM/DISCARD, so are ideal for storing VM images, as they can instantly release space released by the guest OS. They can also be snapshotted and backed up like the rest of ZFS.

Although it is often considered a server filesystem, ZFS has been used in plenty of other situations for some time now, with ports to FreeBSD, Linux, and MacOS. I find it particularly useful:

  • To have faith that my photos, backups, and paperwork archives are intact. zpool scrub at any time will read the entire dataset and verify the integrity of every bit.
  • I can create snapshots of my system before running apt-get dist-upgrade, making it easy to track down issues or roll back to a known-good configuration. Ideal for people tracking sid or testing. One can also easily simply boot from a previous snapshot.
  • Many scripts exist that make frequent snapshots, and retain the for a period of time as a way of protecting work in progress against an accidental rm. There is no reason not to snapshot /home every 5 minutes, for instance. It’s almost as good as storing / in git.

The added level of security in having cheap snapshots available is almost worth it by itself.

ZFS drawbacks

Compared to other Linux filesystems, there are a few drawbacks of ZFS:

  • CDDL will prevent it from ever being part of the Linus kernel tree
  • It is more RAM-hungry than most, although with tuning it can even run on the Raspberry Pi.
  • A 64-bit kernel is strongly preferred, even in low-memory situations.
  • Performance on many small files may be less than ext4
  • The ZFS cache does not shrink and expand in response to changing RAM usage conditions on the system as well as the normal Linux cache does.
  • Compared to btrfs, ZFS lacks some features of btrfs, such as being able to shrink an existing pool or easily change storage allocation on the fly. On the other hand, the features in ZFS have never caused me a kernel panic, and half the things I liked about btrfs seem to have.
  • ZFS is already quite stable on Linux. However, the GRUB, init, and initramfs code supporting booting from a ZFS root and /boot is less stable. If you want to go 100% ZFS, be prepared to tweak your system to get it to boot properly. Once done, however, it is quite stable.

Converting to ZFS

I have written up an extensive HOWTO on converting an existing system to use ZFS. It covers workarounds for all the boot-time bugs I have encountered as well as documenting all steps needed to make it happen. It works quite well.

Additional Hints

If setting up zvols to be used by VirtualBox or some such system, you might be interested in managing zvol ownership and permissions with udev.

Debian-Live Rescue image with ZFS On Linux; Ditched btrfs

I’m a geek. I enjoy playing with different filesystems, version control systems, and, well, for that matter, radios.

I have lately started to worry about the risks of silent data corruption, and as such, looked to switch my personal systems to either ZFS or btrfs, both of which offer built-in checksumming of all data and metadata. I initially opted for btrfs, because of its tighter integration into the Linux kernel and ability to shrink an existing btrfs filesystem.

However, as I wrote last month, that experiment was not a success. I had too many serious performance regressions and one too many kernel panics and decided it wasn’t worth it. And that the SuSE people got it wrong, deeply wrong, when they declared btrfs ready for production. I never lost any data, to its credit. But it simply reduces uptime too much.

That left ZFS. Before I build a system, I always want to make sure I can repair it. So I started with the Debian Live rescue image, and added the zfsonlinux.org repository to it, along with some key packages to enable the ZFS kernel modules, GRUB support, and initramfs support. The resulting image is described, and can be downloaded from, my ZFS Rescue Disc wiki page, which also has a link to my source tree on github.

In future blog posts in the series, I will describe the process of converting existing Debian installations to use ZFS, of getting them to boot from ZFS, some bugs I encountered along the way, and some surprising performance regressions in ZFS compared to ext4 and btrfs.

Results with btrfs and zfs

The recent news that openSUSE considers btrfs safe for users prompted me to consider using it. And indeed I did. I was already familiar with zfs, so considered this a good opportunity to experiment with btrfs.

btrfs makes an intriguing filesystem for all sorts of workloads. The benefits of btrfs and zfs are well-documented elsewhere. There are a number of features btrfs has that zfs lacks. For instance:

  • The ability to shrink a device that’s a member of a filesystem/pool
  • The ability to remove a device from a filesystem/pool entirely, assuming enough free space exists elsewhere for its data to be moved over.
  • Asynchronous deduplication that imposes neither a synchronous performance hit nor a heavy RAM burden
  • Copy-on-write copies down to the individual file level with cp --reflink
  • Live conversion of data between different profiles (single, dup, RAID0, RAID1, etc)
  • Live conversion between on-the-fly compression methods, including none at all
  • Numerous SSD optimizations, including alignment and both synchronous and asynchronous TRIM options
  • Proper integration with the VM subsystem
  • Proper support across the many Linux architectures, including 32-bit ones (zfs is currently only flagged stable on amd64)
  • Does not require excessive amounts of RAM

The feature set of ZFS that btrfs lacks is well-documented elsewhere, but there are a few odd btrfs missteps:

  • There is no way to see how much space subvolume/filesystem is using without turning on quotas. Even then, it is cumbersome and not reported with df like it should be.
  • When a maxmium size for a subvolume is set via a quota, it is not reported via df; applications have no idea when they are about to hit the maximum size of a filesystem.

btrfs would be fine if it worked reliably. I should say at the outset that I have never lost any data due to it, but it has caused enough kernel panics that I’ve lost count. I several times had a file that produced a panic when I tried to delete it, several times when it took more than 12 hours to unmount a btrfs filesystem, behaviors where hardlink-heavy workloads take days longer to complete than on zfs or ext4, and that’s just the ones I wrote about. I tried to use btrfs balance to change the metadata allocation on the filesystem, and never did get it to complete; it seemed to go into an endless I/O pattern after the first 1GB of metadata and never got past that. I didn’t bother trying the live migration of data from one disk to another on this filesystem.

I wanted btrfs to work. I really, really did. But I just can’t see it working. I tried it on my laptop, but had to turn of CoW on my virtual machine’s disk because of the rm bug. I tried it on my backup devices, but it was unusable there due to being so slow. (Also, the hardlink behavior is broken by default and requires btrfstune -r. Yipe.)

At this point, I don’t think it is really all that worth bothering with. I think the SuSE decision is misguided and ill-informed. btrfs will be an awesome filesystem. I am quite sure it will, and will in time probably displace zfs as the most advanced filesystem out there. But that time is not yet here.

In the meantime, I’m going to build a Debian Live Rescue CD with zfsonlinux on it. Because I don’t ever set up a system I can’t repair.

rdiff-backup, ZFS, and rsync scripts

rdiff-backup vs. ZFS

As I’ve been writing about backups, I’ve gone ahead and run some tests with rdiff-backup. I have been using rdiff-backup personally for many years now — probably since 2002, when I packaged it up for Debian. It’s a nice, stable system, but I always like to look at other options for things every so often.

rdiff-backup stores an uncompressed current mirror of the filesystem, similar to rsync. History is achieved by the use of compressed backwards binary deltas generated by rdiff (using the rsync algorithm). So, you can restore the current copy very easily — a simple cp will do if you don’t need to preserve permissions. rdiff-backup restores previous copies by applying all necessary binary deltas to generate the previous version.

Things I like about rdiff-backup:

  1. Bandwidth-efficient
  2. Reasonably space-efficient, especially where history is concerned
  3. Easily scriptable and nice CLI
  4. Unlike tools such as duplicity, there is no need to periodically run full backups — old backups can be deleted without impacting the ability to restore more current backups

Things I don’t like about it:

  1. Speed. It can be really slow. Deleting 3 months’ worth of old history takes hours. It has to unlink vast numbers of files — and that’s pretty much it, but it does it really slowly. Restores, backups, etc. are all slow as well. Even just getting a list of your increment sizes so you’d know how much space would be saved can take a very long time.
  2. The current backup copy is stored without any kind of compression, which is not at all space-efficient
  3. It creates vast numbers of little files that take forever to delete or summarize

So I thought I would examine how efficient ZFS would be. I wrote a script that would replay the rdiff-backup history — first it would rsync the current copy onto the ZFS filesystem and make a ZFS snapshot. Then each previous version was processed by my script (rdiff-backup’s files are sufficiently standard that a shell script can process them), and a ZFS snapshot created after each. This lets me directly compare the space used by rdiff-backup to that used by ZFS using actual history.

I enabled gzip-3 compression and block dedup in ZFS.

My backups were nearly 1TB in size and the amount of space I had available for ZFS was roughly 600GB, so I couldn’t test all of them. As it happened, I tested the ones that were the worst-case scenario for ZFS: my photos, music collection, etc. These files had very little duplication and very little compressibility. Plus a backup of my regular server that was reasonably compressible.

The total size of the data backed up with rdiff-backup was 583 GB. With ZFS, this came to 498GB. My dedup ratio on this was only 1.05 (meaning 5% or 25GB saved). The compression ratio was 1.12 (60GB saved). The combined ratio was 1.17 (85GB saved). Interestingly 498 + 85 = 583.

Remember that the data under test here was mostly a worst-case scenario for ZFS. It would probably have done better had I had the time to throw the rest of my dataset at it (such as the 60GB backup of my iPod, which would have mostly deduplicated with the backup of my music server).

One problem with ZFS is that dedup is very memory-hungry. This is common knowledge and it is advertised that you need to use roughly 2GB of RAM per TB of disk when using dedup. I don’t have quite that much to dedicate to it, so ZFS got VERY slow and thrashed the disk a lot after the ARC grew to about 300MB. I found some tweakables in zfsrc and the zfs command that let me tweak the ARC cache to grow bigger. But the machine in question only has 2GB RAM, and is doing lots of other things as well, so this barely improved anything. Note that this dedup RAM requirement is not out of line with what is expected from these sorts of solutions.

Even if I got absolutely stellar dedup ratio of 2:1, that would get me at most 1TB. The cost of buying a 1TB disk is less than the cost of upgrading my system to 4GB RAM, so dedup isn’t worth it here.

I think the lesson is: think carefully about where dedup makes sense. If you’re storing a bunch of nearly-identical virtual machine images — the sort of canonical use case for this — go for it. A general fileserver — well, maybe you should just add more disk instead of more RAM.

Then that raises the question: if I don’t need dedup from ZFS, do I bother with it at all, or just use ext4 and LVM snapshots? I think ZFS still makes sense, given its built-in support for compression and very fast snapshots — LVM snapshots are known to cause serious degradation to write performance once enabled, which ZFS doesn’t.

So I plan to switch my backups to use ZFS. A few observations on this:

  1. Some testing suggests that the time to delete a few months of old snapshots will be a minute or two with ZFS compared to hours with rdiff-backup.
  2. ZFS has shown itself to be more space-efficient than rdiff-backup, even without dedup enabled.
  3. There are clear performance and convenience wins with ZFS.
  4. Backup Scripts

    So now comes the question of backup scripts. rsync is obviously a pretty nice choice here — and if used with –inplace perhaps even will play friendly with ZFS snapshots even if dedup is off. But let’s say I’m backing up a few machines at home, or perhaps dozens at work. There is a need to automate all of this. Specifically, there’s a need to:

    1. Provide scheduling, making sure that we don’t hammer the server with 30 clients all at once
    2. Provide for “run before” jobs to do things like snapshot databases
    3. Be silent on success and scream loudly via emails to administrators on any kind of error… and keep backing up other systems when there is an error
    4. Create snapshots and provide an automated way to remove old snapshots (or mount them for reading, as ZFS-fuse doesn’t support the .zfs snapshot directory yet)

    To date I haven’t found anything that looks suitable. I found a shell script system called rsbackup that does a large part of this, but something about using a script whose homepage is a forum makes me less than 100% confident.

    On the securing the backups front, rsync comes with a good-looking rrsync script (inexplicably installed under /usr/share/doc/rsync/scripts instead of /usr/bin on Debian) that can help secure the SSH authorization. GNU rush also looks like a useful restricted shell.

Trying out XFS

I’ve used most of the different filesystems in Linux. My most recent favorite has been JFS, but things like starvation with find have really been annoying me lately. To summarize, here is my experience with filesystems:

  • ext2: very slow, moderately unreliable
  • ext3: somewhat slow but reliable
  • reiserfs: fast, unreliable (cross-linked data after crash issues)
  • jfs: usually fast, somewhat unreliable (similar issues after crash, plus weird charset issues)

The one major Linux FS not in that list is XFS. So I decided to give it a whirl, switching my 40GB /home on one machine to XFS. So far, it’s been good.

There are two articles at IBM developerworks about XFS that were useful. There’s also a useful filesystems comparison from Novell.