Tag Archives: zfs

Pipes, deadlocks, and strace annoyingly fixing them

This is a complex tale I will attempt to make simple(ish). I’ve (re)learned more than I cared to about the details of pipes, signals, and certain system calls – and the solution is still elusive.

For some time now, I have been using NNCP to back up my files. These backups are sent to my backup system, which effectively does this to process them (each ZFS send is piped to a shell script that winds up running this):

gpg -q -d | zstdcat -T0 | zfs receive -u -o readonly=on "$STORE/$DEST"

This processes tens of thousands of zfs sends per week. Recently, having written Filespooler, I switched to sending the backups using Filespooler over NNCP. Now fspl (the Filespooler executable) opens the file for each stream and then connects it to what amounts to this pipeline:

bash -c 'gpg -q -d 2>/dev/null | zstdcat -T0' | zfs receive -u -o readonly=on "$STORE/$DEST"

Actually, to be more precise, it spins up the bash part of it, reads a few bytes from it, and then connects it to the zfs receive.

And this works well — almost always. In something like 1/1000 of the cases, it deadlocks, and I still don’t know why. But I can talk about the journey of trying to figure it out (and maybe some of you will have some ideas).

Filespooler is written in Rust, and uses Rust’s Command system. Effectively what happens is this:

  1. The fspl process has a File handle, which after forking but before invoking bash, it dup2’s to stdin.
  2. The connection between bash and zfs receive is a standard Unix pipe.

I cannot get the problem to duplicate when I run the entire thing under strace -f. So I am left trying to peek at it from the outside. What happens if I try to attach to each component with strace -p?

  • bash is blocking in wait4(), which is expected.
  • gpg is blocking in write().
  • If I attach to zstdcat with strace -p, then all of a sudden the deadlock is cleared and everything resumes and completes normally.
  • Attaching to zfs receive with strace -p causes no output at all from strace for a few seconds, then zfs just writes “cannot receive incremental stream: incomplete stream” and exits with error code 1.

So the plot thickens! Why would connecting to zstdcat and zfs receive cause them to actually change behavior? strace works by using the ptrace system call, and ptrace in a number of cases requires sending SIGSTOP to a process. In a complicated set of circumstances, a system call may return EINTR when a SIGSTOP is received, with the idea that the system call should be retried. I can’t see, from either zstdcat or zfs, if this is happening, though.

So I thought, “how about having Filespooler manually copy data from bash to zfs receive in a read/write loop instead of having them connected directly via a pipe?” That is, there would be two pipes going there: one where Filespooler reads from the bash command, and one where it writes to zfs. If nothing else, I could instrument it with debugging.

And so I did, and I found that when it deadlocked, it was deadlocking on write — but with no discernible pattern as to where or when. So I went back to directly connected.

In analyzing straces, I found a Rust bug which I reported in which it is failing to close the read end of a pipe in the parent post-fork. However, having implemented a workaround for this, it doesn’t prevent the deadlock so this is orthogonal to the issue at hand.

Among the two strange things here are things returning to normal when I attach strace to zstdcat, and things crashing when I attach strace to zfs. I decided to investigate the latter.

It turns out that the ZFS code that is reading from stdin during zfs receive is in the kernel module, not userland. Here is the part that is triggering the “imcomplete stream” error:

                int err = zfs_file_read(fp, (char *)buf + done,
                    len - done, &resid);
                if (resid == len - done) {
                        /*
                         * Note: ECKSUM or ZFS_ERR_STREAM_TRUNCATED indicates
                         * that the receive was interrupted and can
                         * potentially be resumed.
                         */
                        err = SET_ERROR(ZFS_ERR_STREAM_TRUNCATED);
                }

resid is an output parameter with the number of bytes remaining from a short read, so in this case, if the read produced zero bytes, then it sets that error. What’s zfs_file_read then?

It boils down to a thin wrapper around kernel_read(). This winds up calling __kernel_read(), which calls read_iter on the pipe, which is pipe_read(). That’s where I don’t have the knowledge to get into the weeds right now.

So it seems likely to me that the problem has something to do with zfs receive. But, what, and why does it only not work in this one very specific situation, and only so rarely? And why does attaching strace to zstdcat make it all work again? I’m indeed puzzled!

Update 2022-06-20: See the followup post which identifies this as likely a kernel bug and explains why this particular use of Filespooler made it easier to trigger.

More Topics on Store-And-Forward (Possibly Airgapped) ZFS and Non-ZFS Backups with NNCP

Note: this is another article in my series on asynchronous communication in Linux with UUCP and NNCP.

In my previous post, I introduced a way to use ZFS backups over NNCP. In this post, I’ll expand on that and also explore non-ZFS backups.

Use of nncp-file instead of nncp-exec

The previous example used nncp-exec (like UUCP’s uux), which lets you pipe stdin in, then queues up a request to run a given command with that input on a remote. I discussed that NNCP doesn’t guarantee order of execution, but that for the ZFS use case, that was fine since zfs receive would just fail (causing NNCP to try again later).

At present, nncp-exec stores the data piped to it in RAM before generating the outbound packet (the author plans to fix this shortly) [Update: This is now fixed; use -use-tmp with nncp-exec!). That made it unusable for some of my backups, so I set it up another way: with nncp-file, the tool to transfer files to a remote machine. A cron job then picks them up and processes them.

On the machine being backed up, we have to find a way to encode the dataset to be received. I chose to do that as part of the filename, so the updated simplesnap-queue could look like this:

#!/bin/bash

set -e
set -o pipefail

DEST="`echo $1 | sed 's,^tank/simplesnap/,,'`"
FILE="bakfsfmt2-`date "+%s.%N".$$`_`echo "$DEST" | sed 's,/,@,g'`"

echo "Processing $DEST to $FILE" >&2
# stdin piped to this
zstd -8 - \
  | gpg --compress-algo none --cipher-algo AES256 -e -r 012345...  \
  | su nncp -c "/usr/local/nncp/bin/nncp-file -nice B -noprogress - 'backupsvr:$FILE'" >&2

echo "Queued $DEST to $FILE" >&2

I’ve added compression and encryption here as well; more on that below.

On the backup server, we would define a different incoming directory for each node in nncp.hjson. For instance:

host1: {
...
   incoming: "/var/local/nncp-bakcups-incoming/host1"
}

host2: {
...
   incoming: "/var/local/nncp-backups-incoming/host2"
}

I’ll present the scanning script in a bit.

Offsite Backup Rotation

Most of the time, you don’t want just a single drive to store the backups. You’d like to have a set. At minimum, one wouldn’t be plugged in so lightning wouldn’t ruin all your backups. But maybe you’d store a second drive at some other location you have access to (friend’s house, bank box, etc.)

There are several ways you could solve this:

  • If the remote machine is at a location with network access and you trust its physical security (remember that although it will store data encrypted at rest and will transport it encrypted, it will — in most cases — handle un-encrypted data during processing), you could of course send NNCP packets to it over the network at the same time you send them to your local backup system.
  • Alternatively, if the remote location doesn’t have network access or you want to keep it airgapped, you could transport the NNCP packets by USB drive to the remote end.
  • Or, if you don’t want to have any kind of processing capability remotely — probably a wise move — you could rotate the hard drives themselves, keeping one plugged in locally and unplugging the other to take it offsite.

The third option can be helped with NNCP, too. One way is to create separate NNCP installations for each of the drives that you store data on. Then, whenever one is plugged in, the appropriate NNCP config will be loaded and appropriate packets received and processed. The neighbor machine — the spooler — would just store up packets for the offsite drive until it comes back onsite (or, perhaps, your airgapped USB transport would do this). Then when it’s back onsite, all the queued up ZFS sends get replayed and the backups replicated.

Now, how might you handle this with NNCP?

The simple way would be to have each system generating backups send them to two destinations. For instance:

zstd -8 - | gpg --compress-algo none --cipher-algo AES256 -e -r 07D5794CD900FAF1D30B03AC3D13151E5039C9D5 \
  | tee >(su nncp -c "/usr/local/nncp/bin/nncp-file -nice B+5 -noprogress - 'backupdisk1:$FILE'") \
        >(su nncp -c "/usr/local/nncp/bin/nncp-file -nice B+5 -noprogress - 'backupdisk2:$FILE'") \
   > /dev/null

You could probably also more safely use pee(1) (from moreutils) to do this.

This has an unfortunate result of doubling the network traffic from every machine being backed up. So an alternative option would be to queue the packets to the spooling machine, and run a distribution script from it; something like this, in part:

INCOMINGDIR="/var/local/nncp-bakfs-incoming"
LOCKFILE="$INCOMINGDIR/.lock"
printf -v EVAL_SAFE_LOCKFILE '%q' "$LOCKFILE"
if dotlockfile -r 0 -l -p "${LOCKFILE}"; then
  logit "Lock obtained at ${LOCKFILE} with dotlockfile"
  trap 'ECODE=$?; dotlockfile -u '"${EVAL_SAFE_LOCKFILE}"'; exit $ECODE' EXIT INT TERM
else
  logit "Could not obtain lock at $LOCKFILE; $0 likely already running."
  exit 0
fi


logit "Scanning queue directory..."
cd "$INCOMINGDIR"
for HOST in *; do
   cd "$INCOMINGDIR/$HOST"
   for FILE in bakfsfmt2-*; do
           if [ -f "$FILE" ]; then
                   for BAKFS in backupdisk1 backupdisk2; do
                           runcommand nncp-file -nice B+5 -noprogress "$FILE" "$BAKFS:$HOST/$FILE"
                   done
                   runcommand rm "$FILE"
           else
                   logit "$HOST: Skipping $FILE since it doesn't exist"
           fi
   done

done
logit "Scan complete."

Security Considerations

You’ll notice that in my example above, the encryption happens as the root user, but nncp is called under su. This means that even if there is a vulnerability in NNCP, the data would still be protected by GPG. I’ll also note here that many sites run ssh as root unnecessarily; the same principles should apply there. (ssh has had vulnerabilities in the past as well). I could have used gpg’s built-in compression, but zstd is faster and better, so we can get good performance by using fast compression and piping that to an algorithm that can use hardware acceleration for encryption.

I strongly encourage considering transport, whether ssh or NNCP or UUCP, to be untrusted. Don’t run it as root if you can avoid it. In my example, the nncp user, which all NNCP commands are run as, has no access to the backup data at all. So even if NNCP were compromised, my backup data wouldn’t be. For even more security, I could also sign the backup stream with gpg and validate that on the receiving end.

I should note, however, that this conversation assumes that a network- or USB-facing ssh or NNCP is more likely to have an exploitable vulnerability than is gpg (which here is just processing a stream). This is probably a safe assumption in general. If you believe gpg is more likely to have an exploitable vulnerability than ssh or NNCP, then obviously you wouldn’t take this particular approach.

On the zfs side, the use of -F with zfs receive is avoided; this could lead to a compromised backed-up machine generating a malicious rollback on the destination. Backup zpools should be imported with -R or -N to ensure that a malicious mountpoint property couldn’t be used to cause an attack. I choose to use “zfs receive -u -o readonly=on” which is compatible with both unmounted backup datasets and zpools imported with -R (or both). To access the data in a backup dataset, you would normally clone it and access it there.

The processing script

So, put this all together and look at an example of a processing script that would run from cron as root and process the incoming ZFS data.

#!/bin/bash
set -e
set -o pipefail

# Log a message
logit () {
   logger -p info -t "`basename "$0"`[$$]" "$1"
}

# Log an error message
logerror () {
   logger -p err -t "`basename "$0"`[$$]" "$1"
}

# Log stdin with the given code.  Used normally to log stderr.
logstdin () {
   logger -p info -t "`basename "$0"`[$$/$1]"
}

# Run command, logging stderr and exit code
runcommand () {
   logit "Running $*"
   if "$@" 2> >(logstdin "$1") ; then
      logit "$1 exited successfully"
      return 0
   else
       RETVAL="$?"
       logerror "$1 exited with error $RETVAL"
       return "$RETVAL"
   fi
}

STORE=backups/simplesnap
INCOMINGDIR=/backups/nncp/incoming

if ! [ -d "$INCOMINGDIR" ]; then
        logerror "$INCOMINGDIR doesn't exist"
        exit 0
fi

LOCKFILE="/backups/nncp/.nncp-backups-zfs-scan.lock"
printf -v EVAL_SAFE_LOCKFILE '%q' "$LOCKFILE"
if dotlockfile -r 0 -l -p "${LOCKFILE}"; then
  logit "Lock obtained at ${LOCKFILE} with dotlockfile"
  trap 'ECODE=$?; dotlockfile -u '"${EVAL_SAFE_LOCKFILE}"'; exit $ECODE' EXIT INT TERM
else
  logit "Could not obtain lock at $LOCKFILE; $0 likely already running."
  exit 0
fi

EXITCODE=0


cd "$INCOMINGDIR"
logit "Scanning queue directory..."
for HOST in *; do
    HOSTPATH="$INCOMINGDIR/$HOST"
    # files like backupsfmt2-134.13134_dest
    for FILE in "$HOSTPATH"/backupsfmt2-[0-9]*_?*; do
        if [ ! -f "$FILE" ]; then
            logit "Skipping non-existent $FILE"
            continue
        fi

        # Now, $DEST will be HOST/DEST.  Strip off the @ also.
        DEST="`echo "$FILE" | sed -e 's/^.*backupsfmt2[^_]*_//' -e 's,@,/,g'`"

        if [ -z "$DEST" ]; then
            logerror "Malformed dest in $FILE"
            continue
        fi
        HOST2="`echo "$DEST" | sed 's,/.*,,g'`"
        if [ -z "$HOST2" ]; then
            logerror "Malformed DEST $DEST in $FILE"
            continue
        fi

        if [ ! "$HOST" = "$HOST2" ]; then
            logerror "$DIR: $HOST doesn't match $HOST2"
            continue
        fi

        logit "Processing $FILE to $STORE/$DEST"
            if runcommand gpg -q -d < "$FILE" | runcommand zstdcat | runcommand zfs receive -u -o readonly=on "$STORE/$DEST"; then
                logit "Successfully processed $FILE to $STORE/$DEST"
                runcommand rm "$FILE"
        else
                logerror "FAILED to process $FILE to $STORE/$DEST"
                EXITCODE=15
        fi

Applying These Ideas to Non-ZFS Backups

ZFS backups made our job easier in a lot of ways:

  • ZFS can calculate a diff based on an efficiently-stored previous local state (snapshot or bookmark), rather than a comparison to a remote state (rsync)
  • ZFS "incremental" sends, while less efficient than rsync, are reasonably efficient, sending only changed blocks
  • ZFS receive detects and enforces that the incremental source on the local machine must match the incremental source of the original stream, enforcing ordering
  • Datasets using ZFS encryption can be sent in their encrypted state
  • Incrementals can be done without a full scan of the filesystem

Some of these benefits you just won't get without ZFS (or something similar like btrfs), but let's see how we could apply these ideas to non-ZFS backups. I will explore the implementation of them in a future post.

When I say "non ZFS", I am being a bit vague as to whether the source, the destination, or both systems are running a non-ZFS filesystem. In general I'll assume that neither are ZFS.

The first and most obvious answer is to just tar up the whole system and send that every day. This is, of course, only suitable for small datasets on a fast network. These tarballs could be unpacked on the destination and stored more efficiently via any number of methods (hardlink trees, a block-level deduplicator like borg or rdedup, or even just simply compressed tarballs).

To make the network trip more efficient, something like rdiff or xdelta could be used. A signature file could be stored on the machine being backed up (generated via tee/pee at stream time), and the next run could simply send an rdiff delta over NNCP. This would be quite network-efficient, but still would require reading every byte of every file on every backup, and would also require quite a bit of temporary space on the receiving end (to apply the delta to the previous tarball and generate a new one).

Alternatively, a program that generates incremental backup files such as rdup could be used. These could be transmitted over NNCP to the backup server, and unpacked there. While perhaps less efficient on the network -- every file with at least one modified byte would be retransmitted in its entirety -- it avoids the need to read every byte of unmodified files or to have enormous temporary space. I should note here that GNU tar claims to have an incremental mode, but it has a potential data loss bug.

There are also some tools with algorithms that may apply well in this use care: syrep and fssync being the two most prominent examples, though rdedup (mentioned above) and the nascent asuran project may also be combinable with other tools to achieve this effect.

I should, of course, conclude this section by mentioning btrfs. Every time I've tried it, I've run into serious bugs, and its status page indicates that only some of them have been resolved. I would not consider using it for something as important as backups. However, if you are comfortable with it, it is likely to be able to run in more constrained environments than ZFS and could probably be processed in much the same way as zfs streams.

Airgapped / Asynchronous Backups with ZFS over NNCP

In my previous articles in the series on asynchronous communication with the modern NNCP tool, I talked about its use for asynchronous, potentially airgapped, backups. The first article, How & Why To Use Airgapped Backups laid out the foundations for this. Now let’s dig into the details.

Today’s post will cover ZFS, because it has a lot of features that make it very easy to support in this setup. Non-ZFS backups will be covered later.

The setup is actually about as simple as it is for SSH, but since people are less familiar with this kind of communication, I’m going to try to go into more detail here.

Assumptions

I am assuming a setup where:

  • The machines being backed up run ZFS
  • The disk(s) that hold the backups are also running ZFS
  • zfs send / receive is desired as an efficient way to transport the backups
  • The machine that holds the backups may have no network connection whatsoever
  • Backups will be sent encrypted over some sort of network to a spooling machine, which temporarily holds them until they are transported to the destination backup system and ingested there. This system will be unable to decrypt the data streams it temporarily stores.

Hardware

Let’s start with hardware for the machine to hold the backups. I initially considered a Raspberry Pi 4 with 8GB of RAM. That would probably have been a suitable machine, at least for smaller backup sets. However, none of the Raspberry Pi machines support hardware AES encryption acceleration, and my Pi4 benchmarks as about 60MB/s for AES encryption. I want my backups to be encrypted, and decided this would just be too slow for my purposes. Again, if you don’t need encrypted backups or don’t care that much about performance — may people probably fall into this category — you can have a fully-functional Raspberry Pi 4 system for under $100 that would make a fantastic backup server.

I wound up purchasing a Qotom-Q355G4 micro PC with a Core i5 for about $315. It has USB 3 ports and is designed as a rugged, long-lasting system. I have been using one of their older Celeron-based models as my router/firewall for a number of years now and it’s been quite reliable.

For backup storage, you can get a USB 3 external drive. My own preference is to get a USB 3 “toaster” (device that lets me plug in SATA drives) so that I have more control over the underlying medium and can save the expense and hassle of a bunch of power supplies. In a future post, I will discuss drive rotation so you always have an offline drive.

Then, there is the question of transport to the backup machine. A simple solution would be to have a heavily-firewalled backup system that has no incoming ports open but makes occasional outgoing connections to one specific NNCP daemon on the spooling machine. However, for airgapped operation, it would also be very simple to use nncp-xfer to transport the data across on a USB stick or some such. You could set up automounting for a specific USB stick – plug it in, all the spooled data is moved over, then plug it in to the backup system and it’s processed, and any outbound email traffic or whatever is copied to the USB stick at that point too. The NNCP page has some more commentary about this kind of setup.

Both are fairly easy to set up, and NNCP is designed to be transport-agnostic, so in this article I’m going to focus on how to integrate ZFS with NNCP.

Operating System

Of course, it should be no surprise that I set this up on Debian.

As an added step, I did all the configuration in Ansible stored in a local git repo. This adds a lot of work, but it means that it is trivial to periodically wipe and reinstall if any security issue is suspected. The git repo can be copied off to another system for storage and takes the system from freshly-installed to ready-to-use state.

Security

There is, of course, nothing preventing you from running NNCP as root. The zfs commands, obviously, need to be run as root. However, from a privilege separation standpoint, I have chosen to run everything relating to NNCP as a nncp user. NNCP already does encryption, but if you prefer to have zero knowledge of the data even to NNCP, it’s trivial to add gpg to the pipeline as well, and in fact I’ll be demonstrating that in a future post for other reasons.

Software

Besides NNCP, there needs to be a system that generates the zfs send streams. For this project, I looked at quite a few. Most were designed to inspect the list of snapshots on a remote end, compare it to a list on the local end, and calculate a difference from there. This, of course, won’t work for this situation.

I realized my own simplesnap project was very close to being able to do this. It already used an algorithm of using specially-named snapshots on the machine being backed up, so never needed any communication about what snapshots were present where. All it needed was a few more options to permit sending to a stream instead of zfs receive. I made those changes and they are available in simplesnap 2.0.0 or above. That version has also been uploaded to sid, and will work fine as-is on buster as well.

Preparing NNCP

I’m going to assume three hosts in this setup:

  • laptop is the machine being backed up. Of course, you may have quite a few of these.
  • spooler holds the backup data until the backup system picks it up
  • backupsvr holds the backups

The basic NNCP workflow documentation covers the basic steps. You’ll need to run nncp-cfgnew on each machine. This generates a basic configuration, along with public and private keys for that machine. You’ll copy the public key sets to the configurations of the other machines as usual. On the laptop, you’ll add a via line like this:

backupsvr: {
  id: ....
  exchpub: ...
  signpub: ...
  noisepub: ...
  via: ["spooler"]

This tells NNCP that data destined for backupsvr should always be sent via spooler first.

You can then arrange for the nncp-daemon to run on the spooler, and nncp-caller or nncp-call on the backupsvr. Or, alternatively, airgapped between the two with nncp-xfer.

Generating Backup Data

Now, on the laptop, install simplesnap (2.0.0 or above). Although you won’t be backing up to the local system, simplesnap still maintains a hostlock in ZFS. Prepate a dataset for it:

zfs create tank/simplesnap
zfs set org.complete.simplesnap:exclude=on tank/simplesnap

Then, create a script /usr/local/bin/runsimplesnap like this:

#!/bin/bash

set -e

simplesnap --store tank/simplesnap --setname backups --local --host `hostname` \
   --receivecmd /usr/local/bin/simplesnap-queue \
   --noreap

su nncp -c '/usr/local/nncp/bin/nncp-toss -noprogress -quiet'

if ip addr | grep -q 192.168.65.64; then
  su nncp -c '/usr/local/nncp/bin/nncp-call -noprogress -quiet -onlinedeadline 1 spooler'
fi

The call to simplesnap sets it up to send the data to simplesnap-queue, which we’ll create in a moment. The –receivmd, plus –noreap, sets it up to run without ZFS on the local system.

The call to nncp-toss will process any previously-received inbound NNCP packets, if there are any. Then, in this example, we do a very basic check to see if we’re on the LAN (checking 192.168.65.64), and if so, will establish a connection to the spooler to transmit the data. If course, you could also do this over the Internet, with tor, or whatever, but in my case, I don’t want to automatically do this in case I’m tethered to mobile. I figure if I want to send backups in that case, I can fire up nncp-call myself. You can also use nncp-caller to set up automated connections on other schedules; there are a lot of options.

Now, here’s what /usr/local/bin/simplesnap-queue looks like:

#!/bin/bash

set -e
set -o pipefail

DEST="`echo $1 | sed 's,^tank/simplesnap/,,'`"

echo "Processing $DEST" >&2
# stdin piped to this
su nncp -c "/usr/local/nncp/bin/nncp-exec -nice B -noprogress backupsvr zfsreceive '$DEST'" >&2
echo "Queued for $DEST" >&2

This is a pretty simple script. simplesnap will call it with a path based on the –store, with the hostname after; so, for instance, tank/simplesnap/laptop/root or some such. This script strips off the leading tank/simplesnap (which is a local fragment), leaving the host and dataset paths. Then it just pipes it to nncp-exec. -nice B classifies it as low-priority bulk data (so if you have some more important interactive data, it would be sent first), then passes it to whatever the backupsvr defines as zfsreceive.

Receiving ZFS backups

In the NNCP configuration on the recipient’s side, in the laptop section, we define what command it’s allowed to run as zfsreceive:

      exec: {
        zfsreceive: ["/usr/bin/sudo", "-H", "/usr/local/bin/nncp-zfs-receive"]
      }

We authorize the nncp user to run this under sudo in /etc/sudoers.d/local–nncp:

Defaults env_keep += "NNCP_SENDER"
nncp ALL=(root) NOPASSWD: /usr/local/bin/nncp-zfs-receive

The NNCP_SENDER is the public key ID of the sending node when nncp-toss processes the incoming data. We can use that for sanity checking later.

Now, here’s a basic nncp-zfs-receive script:

#!/bin/bash
set -e
set -o pipefail

STORE=backups/simplesnap
DEST="$1"

# now process stdin
runcommand zfs receive -o readonly=on -x mountpoint "$STORE/$DEST"

And there you have it — all the basics are in place.

Update 2020-12-30: An earlier version of this article had “zfs receive -F” instead of “zfs receive -o readonly=on -x mountpoint”. These changed arguments are more robust.
Update 2021-01-04: I am now recommending “zfs receive -u -o readonly=on”; see my successor article for more.

Enhancements

You could enhance the nncp-zfs-receive script to improve logging and error handling. For instance:

#!/bin/bash

set -e
set -o pipefail

STORE=backups/simplesnap
# $1 will be the host/dataset

DEST="$1"
HOST="`echo "$1" | sed 's,/.*,,g'`"
if [ -z "$HOST" ]; then
   echo "Malformed command line"
   exit 5
fi

# Log a message
logit () {
   logger -p info -t "`basename "$0"`[$$]" "$1"
}

# Log an error message
logerror () {
   logger -p err -t "`basename "$0"`[$$]" "$1"
}

# Log stdin with the given code.  Used normally to log stderr.
logstdin () {
   logger -p info -t "`basename "$0"`[$$/$1]"
}

# Run command, logging stderr and exit code
runcommand () {
   logit "Running $*"
   if "$@" 2> >(logstdin "$1") ; then
      logit "$1 exited successfully"
      return 0
   else
       RETVAL="$?"
       logerror "$1 exited with error $RETVAL"
       return "$RETVAL"
   fi
}
exiterror () {
   logerror "$1"
   echo "$1" 1>&2
   exit 10
}

# Sanity check

if [ "$HOST" = "laptop" ]; then
  if [ "$NNCP_SENDER" != "12345678" ]; then
    exiterror "Host $HOST doesn't match sender $NNCP_SENDER"
  fi
else
  exiterror "Unknown host $HOST"
fi

runcommand zfs receive -F "$STORE/$DEST"

Now you’ll capture the ZFS receive output in syslog in a friendly way, so you can look back later why things failed if they did.

Further notes on NNCP

nncp-toss will examine the exit code from an invocation. If it is nonzero, it will keep the command (and associated stdin) in the queue and retry it on the next invocation. NNCP does not guarantee order of execution, so it is possible in some cases that ZFS streams may be received in the wrong order. That is fine here; zfs receive will exit with an error, and nncp-toss will just run it again after the dependent snapshots have been received. For non-ZFS backups, a simple sequence number can handle this issue.

How & Why To Use Airgapped Backups

A good backup strategy needs to consider various threats to the integrity of data. For instance:

  • Building catches fire
  • Accidental deletion
  • Equipment failure
  • Security incident / malware / compromise

It’s that last one that is of particular interest today. A lot of backup strategies are such that if a user (or administrator) has their local account or network compromised, their backups could very well be destroyed as well. For instance, do you ssh from the account being backed up to the system holding the backups? Or rsync using a keypair stored on it? Or access S3 buckets, etc? It is trivially easy in many of these schemes to totally ruin cloud-based backups, or even some other schemes. rsync can be run with –delete (and often is, to prune remotes), S3 buckets can be deleted, etc. And even if you try to lock down an over-network backup to be append-only, still there are vectors for attack (ssh credentials, OpenSSL bugs, etc). In this post, I try to explore how we can protect against them and still retain some modern conveniences.

A backup scheme also needs to make a balance between:

  • Cost
  • Security
  • Accessibility
  • Efficiency (of time, bandwidth, storage, etc)

My story so far…

About 20 years ago, I had an Exabyte tape drive, with the amazing capacity of 7GB per tape! Eventually as disk prices fell, I had external disks plugged in to a server, and would periodically rotate them offsite. I’ve also had various combinations of partial or complete offsite copies over the Internet as well. I have around 6TB of data to back up (after compression), a figure that is growing somewhat rapidly as I digitize some old family recordings and videos.

Since I last wrote about backups 5 years ago, my scheme has been largely unchanged; at present I use ZFS for local and to-disk backups and borg for the copies over the Internet.

Let’s take a look at some options that could make this better.

Tape

The original airgapped backup. You back up to a tape, then you take the (fairly cheap) tape out of the drive and put in another one. In cost per GB, tape is probably the cheapest medium out there. But of course it has its drawbacks.

Let’s start with cost. To get a drive that can handle capacities of what I’d be needing, at least LTO-6 (2.5TB per tape) would be needed, if not LTO-7 (6TB). New, these drives cost several thousand dollars, plus they need LVD SCSI or Fibre Channel cards. You’re not going to be hanging one off a Raspberry Pi; these things need a real server with enterprise-style connectivity. If you’re particularly lucky, you might find an LTO-6 drive for as low as $500 on eBay. Then there are tapes. A 10-pack of LTO-6 tapes runs more than $200, and provides a total capacity of 25TB – sufficient for these needs (note that, of course, you need to have at least double the actual space of the data, to account for multiple full backups in a set). A 5-pack of LTO-7 tapes is a little more expensive, while providing more storage.

So all-in, this is going to be — in the best possible scenario — nearly $1000, and possibly a lot more. For a large company with many TB of storage, the initial costs can be defrayed due to the cheaper media, but for a home user, not so much.

Consider that 8TB hard drives can be found for $150 – $200. A pair of them (for redundancy) would run $300-400, and then you have all the other benefits of disk (quicker access, etc.) Plus they can be driven by something as cheap as a Raspberry Pi.

Fancier tape setups involve auto-changers, but then you’re not really airgapped, are you? (If you leave all your tapes in the changer, they can generally be selected and overwritten, barring things like hardware WORM).

As useful as tape is, for this project, it would simply be way more expensive than disk-based options.

Fundamentals of disk-based airgapping

The fundamental thing we need to address with disk-based airgapping is that the machines being backed up have no real-time contact with the backup storage system. This rules out most solutions out there, that want to sync by comparing local state with remote state. If one is willing to throw storage efficiency out the window — maybe practical for very small data sets — one could just send a full backup daily. But in reality, what is more likely needed is a way to store a local proxy for the remote state. Then a “runner” device (a USB stick, disk, etc) could be plugged into the network, filled with queued data, then plugged into the backup system to have the data dequeued and processed.

Some may be tempted to short-circuit this and just plug external disks into a backup system. I’ve done that for a long time. This is, however, a risk, because it makes those disks vulnerable to whatever may be attacking the local system (anything from lightning to ransomware).

ZFS

ZFS is, it should be no surprise, particularly well suited for this. zfs send/receive can send an incremental stream that represents a delta between two checkpoints (snapshots or bookmarks) on a filesystem. It can do this very efficiently, much more so than walking an entire filesystem tree.

Additionally, with the recent addition of ZFS crypto to ZFS on Linux, the replication stream can optionally reflect the encrypted data. Yes, as long as you don’t need to mount them, you can mostly work with ZFS datasets on an encrypted basis, and can directly tell zfs send to just send the encrypted data instead of the decrypted data.

The downside of ZFS is the resource requirements at the destination, which in terms of RAM are higher than most of the older Raspberry Pi-style devices. Still, one could perhaps just save off zfs send streams and restore them later if need be, but that implies a periodic resend of a full stream, an inefficient operation. dedpulicating software such as borg could be used on those streams (though with less effectiveness if they’re encrypted).

Tar

Perhaps surprisingly, tar in listed incremental mode can solve this problem for non-ZFS users. It will keep a local cache of the state of the filesystem as of the time of the last run of tar, and can generate new tarballs that reflect the changes since the previous run (even deletions). This can achieve a similar result to the ZFS send/receive, though in a much less elegant way.

Bacula / Bareos

Bacula (and its fork Bareos) both have support for a FIFO destination. Theoretically this could be used to queue of data for transfer to the airgapped machine. This support is very poorly documented in both and is rumored to have bitrotted, however.

rdiff and xdelta

rdiff and xdelta can be used as sort of a non-real-time rsync, at least on a per-file basis. Theoretically, one could generate a full backup (with tar, ZFS send, or whatever), take an rdiff signature, and send over the file while keeping the signature. On the next run, another full backup is piped into rdiff, and on the basis of the signature file of the old and the new data, it produces a binary patch that can be queued for the backup target to update its stored copy of the file.

This leaves history preservation as an exercise to be undertaken on the backup target. It may not necessarily be easy and may not be efficient.

rsync batches

rsync can be used to compute a delta between two directory trees and express this as a single-file batch that can be processed by a remote rsync. Unfortunately this implies the sender must always keep an old tree around (barring a solution such as ZFS snapshots) in order to compute the delta, and of course it still implies the need for history processing on the remote.

Getting the Data There

OK, so you’ve got an airgapped system, some sort of “runner” device for your sneakernet (USB stick, hard drive, etc). Now what?

Obviously you could just copy data on the runner and move it back off at the backup target. But a tool like NNCP (sort of a modernized UUCP) offer a lot of help in automating the process, returning error reports, etc. NNCP can be used online over TCP, over reliable serial links, over ssh, with offline onion routing via intermediaries or directly, etc.

Imagine having an airgapped machine at a different location you go to frequently (workplace, friend, etc). Before leaving, you put a USB stick in your pocket. When you get there, you pop it in. It’s despooled and processed while you want, and return emails or whatever are queued up to be sent when you get back home. Not bad, eh?

Future installment…

I’m going to try some of these approaches and report back on my experiences in the next few weeks.

Silent Data Corruption Is Real

Here’s something you never want to see:

ZFS has detected a checksum error:

   eid: 138
 class: checksum
  host: alexandria
  time: 2017-01-29 18:08:10-0600
 vtype: disk

This means there was a data error on the drive. But it’s worse than a typical data error — this is an error that was not detected by the hardware. Unlike most filesystems, ZFS and btrfs write a checksum with every block of data (both data and metadata) written to the drive, and the checksum is verified at read time. Most filesystems don’t do this, because theoretically the hardware should detect all errors. But in practice, it doesn’t always, which can lead to silent data corruption. That’s why I use ZFS wherever I possibly can.

As I looked into this issue, I saw that ZFS repaired about 400KB of data. I thought, “well, that was unlucky” and just ignored it.

Then a week later, it happened again. Pretty soon, I noticed it happened every Sunday, and always to the same drive in my pool. It so happens that the highest I/O load on the machine happens on Sundays, because I have a cron job that runs zpool scrub on Sundays. This operation forces ZFS to read and verify the checksums on every block of data on the drive, and is a nice way to guard against unreadable sectors in rarely-used data.

I finally swapped out the drive, but to my frustration, the new drive now exhibited the same issue. The SATA protocol does include a CRC32 checksum, so it seemed (to me, at least) that the problem was unlikely to be a cable or chassis issue. I suspected motherboard.

It so happened I had a 9211-8i SAS card. I had purchased it off eBay awhile back when I built the server, but could never get it to see the drives. I wound up not filling it up with as many drives as planned, so the on-board SATA did the trick. Until now.

As I poked at the 9211-8i, noticing that even its configuration utility didn’t see any devices, I finally started wondering if the SAS/SATA breakout cables were a problem. And sure enough – I realized I had a “reverse” cable and needed a “forward” one. $14 later, I had the correct cable and things are working properly now.

One other note: RAM errors can sometimes cause issues like this, but this system uses ECC DRAM and the errors would be unlikely to always manifest themselves on a particular drive.

So over the course of this, had I not been using ZFS, I would have had several megabytes of reads with undetected errors. Thanks to using ZFS, I know my data integrity is still good.

Backing up every few minutes with simplesnap

I’ve written a lot lately about ZFS, and one of its very nice features is the ability to make snapshots that are lightweight, space-efficient, and don’t hurt performance (unlike, say, LVM snapshots).

ZFS also has “zfs send” and “zfs receive” commands that can send the content of the snapshot, or a delta between two snapshots, as a data stream – similar in concept to an amped-up tar file. These can be used to, for instance, very efficiently send backups to another machine. Rather than having to stat() every single file on a filesystem as rsync has to, it sends effectively an intelligent binary delta — which is also intelligent about operations such as renames.

Since my last search for backup tools, I’d been using BackupPC for my personal systems. But since I switched them to ZFS on Linux, I’ve been wanting to try something better.

There are a lot of tools out there to take ZFS snapshots and send them to another machine, and I summarized them on my wiki. I found zfSnap to work well for taking and rotating snapshots, but I didn’t find anything that matched my criteria for sending them across the network. It seemed par for the course for these tools to think nothing of opening up full root access to a machine from others, whereas I would much rather lock it down with command= in authorized_keys.

So I wrote my own, called simplesnap. As usual, I wrote extensive documentation for it as well, even though it is very simple to set up and use.

So, with BackupPC, a backup of my workstation took almost 8 hours. (Its “incremental” might take as few as 3 hours) With ZFS snapshots and simplesnap, it takes 25 seconds. 25 seconds!

So right now, instead of backing up once a day, I back up once an hour. There’s no reason I couldn’t back up every 5 minutes, in fact. The data consumes less space, is far faster to manage, and doesn’t require a nightly hours-long cleanup process like BackupPC does — zfs destroy on a snapshot just takes a few seconds.

I use a pair of USB disks for backups, and rotate them to offsite storage periodically. They simply run ZFS atop dm-crypt (for security) and it works quite well even on those slow devices.

Although ZFS doesn’t do file-level dedup like BackupPC does, and the lz4 compression I’ve set ZFS to use is less efficient than the gzip-like compression BackupPC uses, still the backups are more space-efficient. I am not quite sure why, but I suspect it’s because there is a lot less metadata to keep track of, and perhaps also because BackupPC has to store a new copy of a file if even a byte changes, whereas ZFS can store just the changed blocks.

Incidentally, I’ve packaged both zfSnap and simplesnap for Debian and both are waiting in NEW.

Why and how to run ZFS on Linux

I’m writing a bit about ZFS these days, and I thought I’d write a bit about why I am using it, why it might or might not be interesting for you, and what you might do about it.

ZFS Features and Background

ZFS is not just a filesystem in the traditional sense, though you can use it that way. It is an integrated storage stack, which can completely replace the need for LVM, md-raid, and even hardware RAID controllers. This permits quite a bit of flexibility and optimization not present when building a stack involving those components. For instance, if a drive in a RAID fails, it needs only rebuild the parts that have actual data stored on them.

Let’s look at some of the features of ZFS:

  • Full checksumming of all data and metadata, providing protection against silent data corruption. The only other Linux filesystem to offer this is btrfs.
  • ZFS is a transactional filesystem that ensures consistent data and metadata.
  • ZFS is copy-on-write, with snapshots that are cheap to create and impose virtually undetectable performance hits. Compare to LVM snapshots, which make writes notoriously slow and require an fsck and mount to get to a readable point.
  • ZFS supports easy rollback to previous snapshots.
  • ZFS send/receive can perform incremental backups much faster than rsync, particularly on systems with many unmodified files. Since it works from snapshots, it guarantees a consistent point-in-time image as well.
  • Snapshots can be turned into writeable “clones”, which simply use copy-on-write semantics. It’s like a cp -r that completes almost instantly and takes no space until you change it.
  • The datasets (“filesystems” or “logical volumes” in LVM terms) in a zpool (“volume group”, to use LVM terms) can shrink or grow dynamically. They can have individual maximum and minimum sizes set, but unlike LVM, where if, say, /usr gets bigger than you thought, you have to manually allocate more space to it, ZFS datasets can use any space available in the pool.
  • ZFS is designed to run well in big iron, and scales to massive amounts of storage. It supports SSDs as L2 cache and ZIL (intent log) devices.
  • ZFS has some built-in compression methods that are quite CPU-efficient and can yield not just space but performance benefits in almost all cases involving compressible data.
  • ZFS pools can host zvols, a block device under /dev that stores its data in the zpool. zvols support TRIM/DISCARD, so are ideal for storing VM images, as they can instantly release space released by the guest OS. They can also be snapshotted and backed up like the rest of ZFS.

Although it is often considered a server filesystem, ZFS has been used in plenty of other situations for some time now, with ports to FreeBSD, Linux, and MacOS. I find it particularly useful:

  • To have faith that my photos, backups, and paperwork archives are intact. zpool scrub at any time will read the entire dataset and verify the integrity of every bit.
  • I can create snapshots of my system before running apt-get dist-upgrade, making it easy to track down issues or roll back to a known-good configuration. Ideal for people tracking sid or testing. One can also easily simply boot from a previous snapshot.
  • Many scripts exist that make frequent snapshots, and retain the for a period of time as a way of protecting work in progress against an accidental rm. There is no reason not to snapshot /home every 5 minutes, for instance. It’s almost as good as storing / in git.

The added level of security in having cheap snapshots available is almost worth it by itself.

ZFS drawbacks

Compared to other Linux filesystems, there are a few drawbacks of ZFS:

  • CDDL will prevent it from ever being part of the Linus kernel tree
  • It is more RAM-hungry than most, although with tuning it can even run on the Raspberry Pi.
  • A 64-bit kernel is strongly preferred, even in low-memory situations.
  • Performance on many small files may be less than ext4
  • The ZFS cache does not shrink and expand in response to changing RAM usage conditions on the system as well as the normal Linux cache does.
  • Compared to btrfs, ZFS lacks some features of btrfs, such as being able to shrink an existing pool or easily change storage allocation on the fly. On the other hand, the features in ZFS have never caused me a kernel panic, and half the things I liked about btrfs seem to have.
  • ZFS is already quite stable on Linux. However, the GRUB, init, and initramfs code supporting booting from a ZFS root and /boot is less stable. If you want to go 100% ZFS, be prepared to tweak your system to get it to boot properly. Once done, however, it is quite stable.

Converting to ZFS

I have written up an extensive HOWTO on converting an existing system to use ZFS. It covers workarounds for all the boot-time bugs I have encountered as well as documenting all steps needed to make it happen. It works quite well.

Additional Hints

If setting up zvols to be used by VirtualBox or some such system, you might be interested in managing zvol ownership and permissions with udev.

Debian-Live Rescue image with ZFS On Linux; Ditched btrfs

I’m a geek. I enjoy playing with different filesystems, version control systems, and, well, for that matter, radios.

I have lately started to worry about the risks of silent data corruption, and as such, looked to switch my personal systems to either ZFS or btrfs, both of which offer built-in checksumming of all data and metadata. I initially opted for btrfs, because of its tighter integration into the Linux kernel and ability to shrink an existing btrfs filesystem.

However, as I wrote last month, that experiment was not a success. I had too many serious performance regressions and one too many kernel panics and decided it wasn’t worth it. And that the SuSE people got it wrong, deeply wrong, when they declared btrfs ready for production. I never lost any data, to its credit. But it simply reduces uptime too much.

That left ZFS. Before I build a system, I always want to make sure I can repair it. So I started with the Debian Live rescue image, and added the zfsonlinux.org repository to it, along with some key packages to enable the ZFS kernel modules, GRUB support, and initramfs support. The resulting image is described, and can be downloaded from, my ZFS Rescue Disc wiki page, which also has a link to my source tree on github.

In future blog posts in the series, I will describe the process of converting existing Debian installations to use ZFS, of getting them to boot from ZFS, some bugs I encountered along the way, and some surprising performance regressions in ZFS compared to ext4 and btrfs.

Results with btrfs and zfs

The recent news that openSUSE considers btrfs safe for users prompted me to consider using it. And indeed I did. I was already familiar with zfs, so considered this a good opportunity to experiment with btrfs.

btrfs makes an intriguing filesystem for all sorts of workloads. The benefits of btrfs and zfs are well-documented elsewhere. There are a number of features btrfs has that zfs lacks. For instance:

  • The ability to shrink a device that’s a member of a filesystem/pool
  • The ability to remove a device from a filesystem/pool entirely, assuming enough free space exists elsewhere for its data to be moved over.
  • Asynchronous deduplication that imposes neither a synchronous performance hit nor a heavy RAM burden
  • Copy-on-write copies down to the individual file level with cp --reflink
  • Live conversion of data between different profiles (single, dup, RAID0, RAID1, etc)
  • Live conversion between on-the-fly compression methods, including none at all
  • Numerous SSD optimizations, including alignment and both synchronous and asynchronous TRIM options
  • Proper integration with the VM subsystem
  • Proper support across the many Linux architectures, including 32-bit ones (zfs is currently only flagged stable on amd64)
  • Does not require excessive amounts of RAM

The feature set of ZFS that btrfs lacks is well-documented elsewhere, but there are a few odd btrfs missteps:

  • There is no way to see how much space subvolume/filesystem is using without turning on quotas. Even then, it is cumbersome and not reported with df like it should be.
  • When a maxmium size for a subvolume is set via a quota, it is not reported via df; applications have no idea when they are about to hit the maximum size of a filesystem.

btrfs would be fine if it worked reliably. I should say at the outset that I have never lost any data due to it, but it has caused enough kernel panics that I’ve lost count. I several times had a file that produced a panic when I tried to delete it, several times when it took more than 12 hours to unmount a btrfs filesystem, behaviors where hardlink-heavy workloads take days longer to complete than on zfs or ext4, and that’s just the ones I wrote about. I tried to use btrfs balance to change the metadata allocation on the filesystem, and never did get it to complete; it seemed to go into an endless I/O pattern after the first 1GB of metadata and never got past that. I didn’t bother trying the live migration of data from one disk to another on this filesystem.

I wanted btrfs to work. I really, really did. But I just can’t see it working. I tried it on my laptop, but had to turn of CoW on my virtual machine’s disk because of the rm bug. I tried it on my backup devices, but it was unusable there due to being so slow. (Also, the hardlink behavior is broken by default and requires btrfstune -r. Yipe.)

At this point, I don’t think it is really all that worth bothering with. I think the SuSE decision is misguided and ill-informed. btrfs will be an awesome filesystem. I am quite sure it will, and will in time probably displace zfs as the most advanced filesystem out there. But that time is not yet here.

In the meantime, I’m going to build a Debian Live Rescue CD with zfsonlinux on it. Because I don’t ever set up a system I can’t repair.

rdiff-backup, ZFS, and rsync scripts

rdiff-backup vs. ZFS

As I’ve been writing about backups, I’ve gone ahead and run some tests with rdiff-backup. I have been using rdiff-backup personally for many years now — probably since 2002, when I packaged it up for Debian. It’s a nice, stable system, but I always like to look at other options for things every so often.

rdiff-backup stores an uncompressed current mirror of the filesystem, similar to rsync. History is achieved by the use of compressed backwards binary deltas generated by rdiff (using the rsync algorithm). So, you can restore the current copy very easily — a simple cp will do if you don’t need to preserve permissions. rdiff-backup restores previous copies by applying all necessary binary deltas to generate the previous version.

Things I like about rdiff-backup:

  1. Bandwidth-efficient
  2. Reasonably space-efficient, especially where history is concerned
  3. Easily scriptable and nice CLI
  4. Unlike tools such as duplicity, there is no need to periodically run full backups — old backups can be deleted without impacting the ability to restore more current backups

Things I don’t like about it:

  1. Speed. It can be really slow. Deleting 3 months’ worth of old history takes hours. It has to unlink vast numbers of files — and that’s pretty much it, but it does it really slowly. Restores, backups, etc. are all slow as well. Even just getting a list of your increment sizes so you’d know how much space would be saved can take a very long time.
  2. The current backup copy is stored without any kind of compression, which is not at all space-efficient
  3. It creates vast numbers of little files that take forever to delete or summarize

So I thought I would examine how efficient ZFS would be. I wrote a script that would replay the rdiff-backup history — first it would rsync the current copy onto the ZFS filesystem and make a ZFS snapshot. Then each previous version was processed by my script (rdiff-backup’s files are sufficiently standard that a shell script can process them), and a ZFS snapshot created after each. This lets me directly compare the space used by rdiff-backup to that used by ZFS using actual history.

I enabled gzip-3 compression and block dedup in ZFS.

My backups were nearly 1TB in size and the amount of space I had available for ZFS was roughly 600GB, so I couldn’t test all of them. As it happened, I tested the ones that were the worst-case scenario for ZFS: my photos, music collection, etc. These files had very little duplication and very little compressibility. Plus a backup of my regular server that was reasonably compressible.

The total size of the data backed up with rdiff-backup was 583 GB. With ZFS, this came to 498GB. My dedup ratio on this was only 1.05 (meaning 5% or 25GB saved). The compression ratio was 1.12 (60GB saved). The combined ratio was 1.17 (85GB saved). Interestingly 498 + 85 = 583.

Remember that the data under test here was mostly a worst-case scenario for ZFS. It would probably have done better had I had the time to throw the rest of my dataset at it (such as the 60GB backup of my iPod, which would have mostly deduplicated with the backup of my music server).

One problem with ZFS is that dedup is very memory-hungry. This is common knowledge and it is advertised that you need to use roughly 2GB of RAM per TB of disk when using dedup. I don’t have quite that much to dedicate to it, so ZFS got VERY slow and thrashed the disk a lot after the ARC grew to about 300MB. I found some tweakables in zfsrc and the zfs command that let me tweak the ARC cache to grow bigger. But the machine in question only has 2GB RAM, and is doing lots of other things as well, so this barely improved anything. Note that this dedup RAM requirement is not out of line with what is expected from these sorts of solutions.

Even if I got absolutely stellar dedup ratio of 2:1, that would get me at most 1TB. The cost of buying a 1TB disk is less than the cost of upgrading my system to 4GB RAM, so dedup isn’t worth it here.

I think the lesson is: think carefully about where dedup makes sense. If you’re storing a bunch of nearly-identical virtual machine images — the sort of canonical use case for this — go for it. A general fileserver — well, maybe you should just add more disk instead of more RAM.

Then that raises the question: if I don’t need dedup from ZFS, do I bother with it at all, or just use ext4 and LVM snapshots? I think ZFS still makes sense, given its built-in support for compression and very fast snapshots — LVM snapshots are known to cause serious degradation to write performance once enabled, which ZFS doesn’t.

So I plan to switch my backups to use ZFS. A few observations on this:

  1. Some testing suggests that the time to delete a few months of old snapshots will be a minute or two with ZFS compared to hours with rdiff-backup.
  2. ZFS has shown itself to be more space-efficient than rdiff-backup, even without dedup enabled.
  3. There are clear performance and convenience wins with ZFS.
  4. Backup Scripts

    So now comes the question of backup scripts. rsync is obviously a pretty nice choice here — and if used with –inplace perhaps even will play friendly with ZFS snapshots even if dedup is off. But let’s say I’m backing up a few machines at home, or perhaps dozens at work. There is a need to automate all of this. Specifically, there’s a need to:

    1. Provide scheduling, making sure that we don’t hammer the server with 30 clients all at once
    2. Provide for “run before” jobs to do things like snapshot databases
    3. Be silent on success and scream loudly via emails to administrators on any kind of error… and keep backing up other systems when there is an error
    4. Create snapshots and provide an automated way to remove old snapshots (or mount them for reading, as ZFS-fuse doesn’t support the .zfs snapshot directory yet)

    To date I haven’t found anything that looks suitable. I found a shell script system called rsbackup that does a large part of this, but something about using a script whose homepage is a forum makes me less than 100% confident.

    On the securing the backups front, rsync comes with a good-looking rrsync script (inexplicably installed under /usr/share/doc/rsync/scripts instead of /usr/bin on Debian) that can help secure the SSH authorization. GNU rush also looks like a useful restricted shell.