Recommendations for Tools for Backing Up and Archiving to Removable Media

May 29, 2023Linux, Online Life, Softwarearchiving, backup, backups, dar, git-annexJohn Goerzen

I have several TB worth of family photos, videos, and other data. This needs to be backed up — and archived.

Backups and archives are often thought of as similar. And indeed, they may be done with the same tools at the same time. But the goals differ somewhat:

Backups are designed to recover from a disaster that you can fairly rapidly detect.

Archives are designed to survive for many years, protecting against disaster not only impacting the original equipment but also the original person that created them.

Reflecting on this, it implies that while a nice ZFS snapshot-based scheme that supports twice-hourly backups may be fantastic for that purpose, if you think about things like family members being able to access it if you are incapacitated, or accessibility in a few decades’ time, it becomes much less appealing for archives. ZFS doesn’t have the wide software support that NTFS, FAT, UDF, ISO-9660, etc. do.

This post isn’t about the pros and cons of the different storage media, nor is it about the pros and cons of cloud storage for archiving; these conversations can readily be found elsewhere. Let’s assume, for the point of conversation, that we are considering BD-R optical discs as well as external HDDs, both of which are too small to hold the entire backup set.

What would you use for archiving in these circumstances?

Establishing goals

The goals I have are:

Archives can be restored using Linux or Windows (even though I don’t use Windows, this requirement will ensure the broadest compatibility in the future)
The archival system must be able to accommodate periodic updates consisting of new files, deleted files, moved files, and modified files, without requiring a rewrite of the entire archive dataset
Archives can ideally be mounted on any common OS and the component files directly copied off
Redundancy must be possible. In the worst case, one could manually copy one drive/disc to another. Ideally, the archiving system would automatically track making n copies of data.
While a full restore may be a goal, simply finding one file or one directory may also be a goal. Ideally, an archiving system would be able to quickly tell me which discs/drives contain a given file.
Ideally, preserves as much POSIX metadata as possible (hard links, symlinks, modification date, permissions, etc). However, for the archiving case, this is less important than for the backup case, with the possible exception of modification date.
Must be easy enough to do, and sufficiently automatable, to allow frequent updates without error-prone or time-consuming manual hassle

I would welcome your ideas for what to use. Below, I’ll highlight different approaches I’ve looked into and how they stack up.

Basic copies of directories

The initial approach might be one of simply copying directories across. This would work well if the data set to be archived is smaller than the archival media. In that case, you could just burn or rsync a new copy with every update and be done. Unfortunately, this is much less convenient with data of the size I’m dealing with. rsync is unavailable in that case. With some datasets, you could manually design some rsyncs to store individual directories on individual devices, but that gets unwieldy fast and isn’t scalable.

You could use something like my datapacker program to split the data across multiple discs/drives efficiently. However, updates will be a problem; you’d have to re-burn the entire set to get a consistent copy, or rely on external tools like mtree to reflect deletions. Not very convenient in any case.

So I won’t be using this.

tar or zip

While you can split tar and zip files across multiple media, they have a lot of issues. GNU tar’s incremental mode is clunky and buggy; zip is even worse. tar files can’t be read randomly, making it extremely time-consuming to extract just certain files out of a tar file.

The only thing going for these formats (and especially zip) is the wide compatibility for restoration.

dar

Here we start to get into the more interesting tools. Dar is, in my opinion, one of the best Linux tools that few people know about. Since I first wrote about dar in 2008, it’s added some interesting new features; among them, binary deltas and cloud storage support. So, dar has quite a few interesting features that I make use of in other ways, and could also be quite helpful here:

Dar can both read and write files sequentially (streaming, like tar), or with random-access (quick seek to extract a subset without having to read the entire archive)
Dar can apply compression to individual files, rather than to the archive as a whole, faciliting both random access and resilience (corruption in one file doesn’t invalidate all subsequent files). Dar also supports numerous compression algorithms including gzip, bzip2, xz, lzo, etc., and can omit compressing already-compressed files.
The end of each dar file contains a central directory (dar calls this a catalog). The catalog contains everything necessary to extract individual files from the archive quickly, as well as everything necessary to make a future incremental archive based on this one. Additionally, dar can make and work with “isolated catalogs” — a file containing the catalog only, without data.
Dar can split the archive into multiple pieces called slices. This can best be done with fixed-size slices (–slice and –first-slice options), which let the catalog regord the slice number and preserves random access capabilities. With the –execute option, dar can easily wait for a given slice to be burned, etc.
Dar normally stores an entire new copy of a modified file, but can optionally store an rdiff binary delta instead. This has the potential to be far smaller (think of a case of modifying metadata for a photo, for instance).

Additionally, dar comes with a dar_manager program. dar_manager makes a database out of dar catalogs (or archives). This can then be used to identify the precise archive containing a particular version of a particular file.

All this combines to make a useful system for archiving. Isolated catalogs are tiny, and it would be easy enough to include the isolated catalogs for the entire set of archives that came before (or even the dar_manager database file) with each new incremental archive. This would make restoration of a particular subset easy.

The main thing to address with dar is that you do need dar to extract the archive. Every dar release comes with source code and a win64 build. dar also supports building a statically-linked Linux binary. It would therefore be easy to include win64 binary, Linux binary, and source with every archive run. dar is also a part of multiple Linux and BSD distributions, which are archived around the Internet. I think this provides a reasonable future-proofing to make sure dar archives will still be readable in the future.

The other challenge is user ability. While dar is highly portable, it is fundamentally a CLI tool and will require CLI abilities on the part of users. I suspect, though, that I could write up a few pages of instructions to include and make that a reasonably easy process. Not everyone can use a CLI, but I would expect a person that could follow those instructions could be readily-enough found.

One other benefit of dar is that it could easily be used with tapes. The LTO series is liked by various hobbyists, though it could pose formidable obstacles to non-hobbyists trying to aceess data in future decades. Additionally, since the archive is a big file, it lends itself to working with par2 to provide redundancy for certain amounts of data corruption.

git-annex

git-annex is an interesting program that is designed to facilitate managing large sets of data and moving it between repositories. git-annex has particular support for offline archive drives and tracks which drives contain which files.

The idea would be to store the data to be archived in a git-annex repository. Then git-annex commands could generate filesystem trees on the external drives (or trees to br burned to read-only media).

In a post about using git-annex for blu-ray backups, an earlier thread about DVD-Rs was mentioned.

This has a few interesting properties. For one, with due care, the files can be stored on archival media as regular files. There are some different options for how to generate the archives; some of them would place the entire git-annex metadata on each drive/disc. With that arrangement, one could access the individual files without git-annex. With git-annex, one could reconstruct the final (or any intermediate) state of the archive appropriately, handling deltions, renames, etc. You would also easily be able to know where copies of your files are.

The practice is somewhat more challenging. Hundreds of thousands of files — what I would consider a medium-sized archive — can pose some challenges, running into hours-long execution if used in conjunction with the directory special remote (but only minutes-long with a standard git-annex repo).

Ruling out the directory special remote, I had thought I could maybe just work with my files in git-annex directly. However, I ran into some challenges with that approach as well. I am uncomfortable with git-annex mucking about with hard links in my source data. While it does try to preserve timestamps in the source data, these are lost on the clones. I wrote up my best effort to work around all this.

In a forum post, the author of git-annex comments that “I don’t think that CDs/DVDs are a particularly good fit for git-annex, but it seems a couple of users have gotten something working.” The page he references is Managing a large number of files archived on many pieces of read-only medium. Some of that discussion is a bit dated (for instance, the directory special remote has the importtree feature that implements what was being asked for there), but has some interesting tips.

git-annex supplies win64 binaries, and git-annex is included with many distributions as well. So it should be nearly as accessible as dar in the future. Since git-annex would be required to restore a consistent recovery image, similar caveats as with dar apply; CLI experience would be needed, along with some written instructions.

Bacula and BareOS

Although primarily tape-based archivers, these do also also nominally support drives and optical media. However, they are much more tailored as backup tools, especially with the ability to pull from multiple machines. They require a database and extensive configuration, making them a poor fit for both the creation and future extractability of this project.

Conclusions

I’m going to spend some more time with dar and git-annex, testing them out, and hope to write some future posts about my experiences.

15 thoughts on “Recommendations for Tools for Backing Up and Archiving to Removable Media”

M.J. says: @ social.treehouse.systems

May 29, 2023 at 5:07 pm

@jgoerzen I’ve recently been through this cleaning out old data and building a new archive. I went with git-annex, now I’m trying to figure out how to get my data out of it. cp -L seems to be my saving grace to follow symbolic links that are everywhere. Some things I’m doing for my next iteration: – Compartmentalizing to a Toughbook dedicated to organizing my archive- Using M-DISC’s for cold data instead of BlueRay (due to rapid bit rot) – Using NVME to USB external drives for hot/warm data – … TBD – hashes and par2 or some sort of recovery mechanism in case of bit rot.

Reply
John Goerzen says: @ floss.social

May 29, 2023 at 5:09 pm

@mj git-annex’s unlocked mode (combined with thin) may help you there.There’s a lot of conversation on r/DataHoarders about M-DISC and whether or not they’re still worth it in the BD-R era, since it seems like BD-Rs already use a chemistry that is much more like what M-DISC was doing for DVDs. IOW, they may not add much now. Still, I may try them out, at least for one copy. I think I may burn at least 2 copies of each disc, using different brands of media.

Reply
M.J. says: @ social.treehouse.systems

May 29, 2023 at 5:13 pm

@jgoerzen Interesting, I’ll have to check it out, I’ve been trying to figure out my next archive kit, thanks. I expect they’ll only live about a decade or so in my archive before I replace them with something new, that’s the biggest thing I learned after making my last backup station was portability and frequent upgrades. Thanks for the git-annex tip, I was sure there was something easy I was missing.

Reply
Saša says:

May 29, 2023 at 12:54 pm

Hello,

I simply like restic…

Reply
Terry Hancock says: @ realsocial.life

May 29, 2023 at 6:28 pm

@jgoerzenIf using optical media for long term archives (>10 years), you should acquire “M-Disc” BD-R media to use. They don’t fade like dye-based media.I use them for my annual archives.As for the data, I stage archives on a scratch drive and then burn them to disk with K3B or Brazero. As you point out, archives are a whole different mindset from regular backups.And thanks for the onfo about Dar. I should look into that!

Reply
Mantas says:

May 29, 2023 at 10:56 pm

We use Restic to back up Windows fileservers.

Reply
1. John Goerzen says:
  
  May 30, 2023 at 4:07 pm
  
  I don’t think that Restic will work well with optical discs that are smaller than the backup set.
  
  Reply
Pingback: Links 30/05/2023: LibreOffice 7.6 in Review and More Digital Restrictions (DRM) From HP | Techrights
anarcat says: @ kolektiva.social

May 30, 2023 at 2:27 pm

@jgoerzen one piece of advice: don’t use git-annex’s encryption systems. it’s clunky, confusing, and really error-prone. it’s nice to have encryption at rest, but if i would really need this in a next project, i’d use something like tomb AKA just LUKS… or a borg backend.

Reply
anarcat says: @ kolektiva.social

May 30, 2023 at 2:28 pm

@jgoerzen also, in my experience the main problem with git-annex (on top of usability, which can be nightmarish if you don’t know git well enough) is that it can’t handle a large number of files reasonably enough. things become slow quick, so i have multiple git-annex repositories and syncing those is a problem on its own.

Reply
John Goerzen says: @ changelog.complete.org

June 16, 2023 at 3:09 pm

In my recent post about data archiving to removable media, I laid out the difference between backing up and archiving, and also said I’d evaluate git-annex and dar. This post evaluates git-annex. The next will look at dar, and then I’ll make a comparison post.
What is git-annex?
git-annex is a fantastic and versatile program that does… well, it’s one of those things that can do so much that it’s a bit hard to describe. Its homepage says:

git-annex allows managing large files with git, without storing the file contents in git. It can sync, backup, and archive your data, offline and online. Checksums and encryption keep your data safe and secure. Bring the power and distributed nature of git to bear on your large files with git-annex.

I think the particularly interesting features of git-annex aren’t actually included in that list. Among the features of git-annex that make it shine for this purpose, its location tracking is key. git-annex can know exactly which device has which file at which version at all times. Combined with its preferred content settings, this lets you very easily say things like:

“I want exactly 1 copy of every file to exist within the set #1 of backup drives. Here’s a drive in that set; copy to it whatever needs to be copied to satisfy that requirement.”
“Now I have another set of backup drives. Periodically I will swap sets offsite. Copy whatever is needed to this drive in the second set, making sure that there is 1 copy of every file within this set as well, regardless of what’s in the first set.”
“Here’s a directory I want to use to track the status of everything else. I don’t want any copies at all here.”

git-annex can be set to allow a configurable amount of free space to remain on a device, and it will fill it up with whatever copies are necessary up until it hits that limit. Very convenient!
git-annex will store files in a folder structure that mirrors the origin folder structure, in plain files just as they were. This maximizes the ability for a future person to access the content, since it is all viewable without any special tool at all. Of course, for things like optical media, git-annex will essentially be creating what amounts to incrementals. To obtain a consistent copy of the original tree, you would still need to use git-annex to process (export) the archives.
git-annex challenges
In my prior post, I related some challenges with git-annex. The biggest of them – quite poor performance of the directory special remote when dealing with many files – has been resolved by Joey, git-annex’s author! That dramatically improves the git-annex use scenario here! The fixing commit is in the source tree but not yet in a release.
git-annex no doubt may still have performance challenges with repositories in the 100,000+-range, but in that order of magnitude it now looks usable. I’m not sure about 1,000,000-file repositories (I haven’t tested); there is a page about scalability.
A few other more minor challenges remain:

git-annex doesn’t really preserve POSIX attributes; for instance, permissions, symlink destinations, and timestamps are all not preserved. Of these, timestamps are the most important for my particular use case.
If your data set to archive contains Git repositories itself, these will not be included.

I worked around the timestamp issue by using the mtree-netbsd package in Debian. mtree writes out a summary of files and metadata in a tree, and can restore them. To save:
mtree -c -R nlink,uid,gid,mode -p /PATH/TO/REPO -X /tmp/spec
And, after restoration, the timestamps can be applied with:
mtree -t -U -e
Walkthrough: initial setup
To use git-annex in this way, we have to do some setup. My general approach is this:

There is a source of data that lives outside git-annex. I’ll call this $SOURCEDIR.
I’m going to name the directories holding my data $REPONAME.
There will be a “coordination” git-annex repo. It will hold metadata only, and no data. This will let us track where things live. I’ll call it $METAREPO.
There will be drives. For this example, I’ll call their mountpoints $DRIVE01 and $DRIVE02. For easy demonstration purposes, I used a ZFS dataset with a refquota set (to observe the size handling), but I could have as easily used a LVM volume, btrfs dataset, loopback filesystem, or USB drive. For optical discs, this would be a staging area or a UDF filesystem.

Let’s get started! I’ve set all these shell variables appropriately for this example, and REPONAME to “testdata”. We’ll begin by setting up the metadata-only tracking repo.
$ REPONAME=testdata $ mkdir "$METAREPO" $ cd "$METAREPO" $ git init $ git config annex.thin true
There is a sort of complicated topic of how git-annex stores files in a repo, which varies depending on whether the data for the file is present in a given repo, and whether the file is locked or unlocked. Basically, the options I use here cause git-annex to mostly use hard links instead of symlinks or pointer files, for maximum compatibility with non-POSIX filesystems such as NTFS and UDF, which might be used on these devices. thin is part of that.
Let’s continue:
$ git annex init 'local hub' init local hub ok (recording state in git...) $ git annex wanted . "include=* and exclude=$REPONAME/*" wanted . ok (recording state in git...)
In a bit, we are going to import the source data under the directory named $REPONAME (here, testdata). The wanted command says: in this repository (represented by the bare dot), the files we want are matched by the rule that says eveyrthing except what’s under $REPONAME. In other words, we don’t want to make an unnecessary copy here.
Because I expect to use an mtree file as documented above, and it is not under $REPONAME/, it will be included. Let’s just add it and tweak some things.
$ touch mtree $ git annex add mtree add mtree ok (recording state in git...) $ git annex sync git-annex sync will change default behavior to operate on --content in a future version of git-annex. Recommend you explicitly use --no-content (or -g) to prepare for that change. (Or you can configure annex.synccontent) commit [main (root-commit) 6044742] git-annex in local hub 1 file changed, 1 insertion(+) create mode 120000 mtree ok $ ls -l total 9 lrwxrwxrwx 1 jgoerzen jgoerzen 178 Jun 15 22:31 mtree -> .git/annex/objects/pX/ZJ/...
OK! We’ve added a file, and it got transformed into a symlink. That’s the thing I said we were going to avoid, so:
git annex adjust --unlock-present adjust Switched to branch 'adjusted/main(unlockpresent)' ok $ ls -l total 1 -rw-r--r-- 2 jgoerzen jgoerzen 0 Jun 15 22:31 mtree
You’ll notice it transformed into a hard link (nlinks=2) file. Great! Now let’s import the source data. For that, we’ll use the directory special remote.
$ git annex initremote source type=directory directory=$SOURCEDIR importtree=yes encryption=none initremote source ok (recording state in git...) $ git annex enableremote source directory=$SOURCEDIR enableremote source ok (recording state in git...) $ git config remote.source.annex-readonly true $ git config annex.securehashesonly true $ git config annex.genmetadata true $ git config annex.diskreserve 100M $ git config remote.source.annex-tracking-branch main:$REPONAME
OK, so here we created a new remote named “source”. We enabled it, and set some configuration. Most notably, that last line causes files from “source” to be imported under $REPONAME/ as we wanted earlier. Now we’re ready to scan the source.
$ git annex sync
At this point, you’ll see git-annex computing a hash for every file in the source directory.
I can verify with du that my metadata-only repo only uses 14MB of disk space, while my source is around 4GB.
Now we can see what git-annex thinks about file locations:
$ git-annex whereis | less whereis mtree (1 copy) 8aed01c5-da30-46c0-8357-1e8a94f67ed6 -- local hub [here] ok whereis testdata/[redacted] (0 copies) The following untrusted locations may also have copies: 9e48387e-b096-400a-8555-a3caf5b70a64 -- [source] failed ... many more lines ...
So remember we said we wanted mtree, but nothing under testdata, under this repo? That’s exactly what we got. git-annex knows that the files under testdata can be found under the “source” special remote, but aren’t in any git-annex repo — yet. Now we’ll start adding them.
Walkthrough: removable drives
I’ve set up two 500MB filesystems to represent removable drives. We’ll see how git-annex works with them.
$ cd $DRIVE01 $ df -h . Filesystem Size Used Avail Use% Mounted on acrypt/no-backup/annexdrive01 500M 1.0M 499M 1% /acrypt/no-backup/annexdrive01 $ git clone $METAREPO Cloning into 'testdata'... done. $ cd $REPONAME $ git config annex.thin true $ git annex init "test drive #1" $ git annex adjust --hide-missing --unlock adjust Switched to branch 'adjusted/main(hidemissing-unlocked)' ok $ git annex sync
OK, that’s the initial setup. Now let’s enable the source remote and configure it the same way we did before:
$ git annex enableremote source directory=$SOURCEDIR enableremote source ok (recording state in git...) $ git config remote.source.annex-readonly true $ git config remote.source.annex-tracking-branch main:$REPONAME $ git config annex.securehashesonly true $ git config annex.genmetadata true $ git config annex.diskreserve 100M
Now, we’ll add the drive to a group called “driveset01” and configure what we want on it:
$ git annex group . driveset01 $ git annex wanted . '(not copies=driveset01:1)'
What this does is say: first of all, this drive is in a group named driveset01. Then, this drive wants any files for which there isn’t already at least one copy in driveset01.
Now let’s load up some files!
$ git annex sync --content
As the messages fly by from here, you’ll see it mentioning that it got mtree, and then various files from “source” — until, that is, the filesystem had less than 100MB free, at which point it complained of no space for the rest. Exactly like we wanted!
Now, we need to teach $METAREPO about $DRIVE01.
$ cd $METAREPO $ git remote add drive01 $DRIVE01/$REPONAME $ git annex sync drive01 git-annex sync will change default behavior to operate on --content in a future version of git-annex. Recommend you explicitly use --no-content (or -g) to prepare for that change. (Or you can configure annex.synccontent) commit On branch adjusted/main(unlockpresent) nothing to commit, working tree clean ok merge synced/main (Merging into main...) Updating d1d9e53..817befc Fast-forward (Merging into adjusted branch...) Updating 7ccc20b..861aa60 Fast-forward ok pull drive01 remote: Enumerating objects: 214, done. remote: Counting objects: 100% (214/214), done. remote: Compressing objects: 100% (95/95), done. remote: Total 110 (delta 6), reused 0 (delta 0), pack-reused 0 Receiving objects: 100% (110/110), 13.01 KiB | 1.44 MiB/s, done. Resolving deltas: 100% (6/6), completed with 6 local objects. From /acrypt/no-backup/annexdrive01/testdata * [new branch] adjusted/main(hidemissing-unlocked) -> drive01/adjusted/main(hidemissing-unlocked) * [new branch] adjusted/main(unlockpresent) -> drive01/adjusted/main(unlockpresent) * [new branch] git-annex -> drive01/git-annex * [new branch] main -> drive01/main * [new branch] synced/main -> drive01/synced/main ok
OK! This step is important, because drive01 and drive02 (which we’ll set up shortly) won’t necessarily be able to reach each other directly, due to not being plugged in simultaneously. Our $METAREPO, however, will know all about where every file is, so that the “wanted” settings can be correctly resolved. Let’s see what things look like now:
$ git annex whereis | less whereis mtree (2 copies) 8aed01c5-da30-46c0-8357-1e8a94f67ed6 -- local hub [here] b46fc85c-c68e-4093-a66e-19dc99a7d5e7 -- test drive #1 [drive01] ok whereis testdata/[redacted] (1 copy) b46fc85c-c68e-4093-a66e-19dc99a7d5e7 -- test drive #1 [drive01]
The following untrusted locations may also have copies:
9e48387e-b096-400a-8555-a3caf5b70a64 — [source]
ok

If I scroll down a bit, I’ll see the files past the 400MB mark that didn’t make it onto drive01. Let’s add another example drive!
Walkthrough: Adding a second drive
The steps for $DRIVE02 are the same as we did before, just with drive02 instead of drive01, so I’ll omit listing it all a second time. Now look at this excerpt from whereis:
whereis testdata/[redacted] (1 copy) b46fc85c-c68e-4093-a66e-19dc99a7d5e7 -- test drive #1 [drive01]
The following untrusted locations may also have copies:
9e48387e-b096-400a-8555-a3caf5b70a64 — [source]
ok
whereis testdata/[redacted] (1 copy)
c4540343-e3b5-4148-af46-3f612adda506 — test drive #2 [drive02]
The following untrusted locations may also have copies:
9e48387e-b096-400a-8555-a3caf5b70a64 — [source]
ok

Look at that! Some files on drive01, some on drive02, some neither place. Perfect!
Walkthrough: Updates
So I’ve made some changes in the source directory: moved a file, added another, and deleted one. All of these were copied to drive01 above. How do we handle this?
First, we update the metadata repo:
$ cd $METAREPO $ git annex sync $ git annex dropunused all
OK, this has scanned $SOURCEDIR and noted changes. Let’s see what whereis says:
$ git annex whereis | less ... whereis testdata/cp (0 copies) The following untrusted locations may also have copies: 9e48387e-b096-400a-8555-a3caf5b70a64 -- [source] failed whereis testdata/file01-unchanged (1 copy) b46fc85c-c68e-4093-a66e-19dc99a7d5e7 -- test drive #1 [drive01]
The following untrusted locations may also have copies:
9e48387e-b096-400a-8555-a3caf5b70a64 — [source]
ok

So this looks right. The file I added was a copy of /bin/cp. I moved another file to one named file01-unchanged. Notice that it realized this was a rename and that the data still exists on drive01.
Well, let’s update drive01.
$ cd $DRIVE01/$REPONAME $ git annex sync --content
Looking at the testdata/ directory now, I see that file01-unchanged has been renamed, the deleted file is gone, but cp isn’t yet here — probably due to space issues; as it’s new, it’s undefined whether it or some other file would fill up free space. Let’s work along a few more commands.
$ git annex get --auto $ git annex drop --auto $ git annex dropunused all
And now, let’s make sure metarepo is updated with its state.
$ cd $METAREPO $ git annex sync
We could do the same for drive02. This is how we would proceed with every update.
Walkthrough: Restoration
Now, we have bare files at reasonable locations in drive01 and drive02. But, to generate a consistent restore, we need to be able to actually do an export. Otherwise, we may have files with old names, duplicate files, etc. Let’s assume that we lost our source and metadata repos and have to restore from scratch. We’ll make a new $RESTOREDIR. We’ll begin with drive01 since we used it most recently.
$ mv $METAREPO $METAREPO.disabled $ mv $SOURCEDIR $SOURCEDIR.disabled $ git clone $DRIVE01/$REPONAME $RESTOREDIR $ cd $RESTOREDIR $ git config annex.thin true $ git annex init "restore" $ git annex adjust --hide-missing --unlock
Now, we need to connect the drive01 and pull the files from it.
$ git remote add drive01 $DRIVE01/$REPONAME $ git annex sync --content
Now, repeat with drive02:
$ git remote add drive02 $DRIVE02/$REPONAME $ git annex sync --content
Now we’ve got all our content back! Here’s what whereis looks like:
whereis testdata/file01-unchanged (3 copies) 3d663d0f-1a69-4943-8eb1-f4fe22dc4349 -- restore [here] 9e48387e-b096-400a-8555-a3caf5b70a64 -- source b46fc85c-c68e-4093-a66e-19dc99a7d5e7 -- test drive #1 [origin] ok ...
I was a little surprised that drive01 didn’t seem to know what was on drive02. Perhaps that could have been remedied by adding more remotes there? I’m not entirely sure; I’d thought would have been able to do that automatically.
Conclusions
I think I have demonstrated two things:
First, git-annex is indeed an extremely powerful tool. I have only scratched the surface here. The location tracking is a neat feature, and being able to just access the data as plain files if all else fails is nice for future users.
Secondly, it is also a complex tool and difficult to get right for this purpose (I think much easier for some other purposes). For someone that doesn’t live and breathe git-annex, it can be hard to get right. In fact, I’m not entirely sure I got it right here. Why didn’t drive02 know what files were on drive01 and vice-versa? I don’t know, and that reflects some kind of misunderstanding on my part about how metadata is synced; perhaps more care needs to be taken in restore, or done in a different order, than I proposed. I initially tried to do a restore by using git annex export to a directory special remote with exporttree=yes, but I couldn’t ever get it to actually do anything, and I don’t know why.
These two cut against each other. On the one hand, the raw accessibility of the data to someone with no computer skills is unmatched. On the other hand, I’m not certain I have the skill to always prepare the discs properly, or to do a proper consistent restore.

Reply
John Goerzen says: @ changelog.complete.org

June 17, 2023 at 1:16 am

This is the third post in a series about data archiving to removable media (optical discs and hard drives). In the first, I explained the difference between backing up and archiving, established goals for the project, and said I’d evaluate git-annex and dar. The second post evaluated git-annex, and now it’s time to look at dar. The series will conclude with a post comparing git-annex with dar.
What is dar?
I could open with the same thing I did with git-annex, just changing the name of the program: “[dar] is a fantastic and versatile program that does… well, it’s one of those things that can do so much that it’s a bit hard to describe.” It is, fundamentally, an archiver like tar or zip (makes one file representing a bunch of other files), but it goes far beyond that. dar’s homepage lays out a comprehensive list of features, which I will try to summarize here.

Dar itself is both a library (with C++ and Python bindings) for interacting with data, and a CLI tool (dar itself).
Alongside this, there is an ecosystem of tools around dar, including GUIs for multiple platforms, backup scripts, and FUSE implementations.
Dar is like tar in that it can read and write files sequentially if desired. Dar archives can be streamed, just like tar archives. But dar takes it further; if you have dar_slave on the remote end, random access is possible over ssh (dramatically speeding up certain operations).
Dar is like zip in that a dar archive contains a central directory (called a catalog) which permits random access to the contents of an archive. In other words, you don’t have to read an entire archive to extract just one file (assuming the archive is on disk or something that itself permits random access). Also, dar can compress each file individually, rather than the tar approach of compressing the archive as a whole. This increases archive performance (dar knows not to try to compress already-compressed data), boosts restore resilience (corruption of one part of an archive doesn’t invalidate the entire rest of it), and boosts restore performance (permitting random access).
Dar can split an archive into multiple pieces called slices, and it can even split member files among the slices. The catalog contains information allowing you to know which slice(s) a given file is saved in.
The catalog can also be saved off in a file of its own (dar calls this an “isolated catalog”). Isolated catalogs record just metadata about files archived.
dar_manager can assemble a database by reading archives or isolated catalogs, letting you know where files are stored and facilitating restores using the minimal number of discs.
Dar supports differential/incremental backups, which record changes since the last backup. These backups record not just additions, but also deletions. dar can optionally use rsync-style binary deltas to minimize the space needed to record changes. Dar does not suffer from GNU tar’s data loss bug with incrementals.
Dar can “slice and dice” archives like Perl does strings. The usage notes page shows how you can merge archives, create decremental archives (where the full backup always reflects the current state of the system, and incrementals go backwards in time instead of forwards), etc. You can change the compression algorithm on an existing archive, re-slice it, etc.
Dar is extremely careful about preserving all metadata: hard links, sparse files, symlinks, timestamps (including subsecond resolution), EAs, POSIX ACLs, resource forks on Mac, detecting files being modified while being read, etc. It makes a nice way to copy directories, sort of similar to rsync -avxHAXS.

So to tie this together for this project, I will set up a 400MB slice size (to mimic what I did with git-annex), and see how dar saves the data and restores it.
Isolated cataloges aren’t strictly necessary for this, but by using them (and/or dar_manager), we can build up a database of files and locations and thus directly compare dar to git-annex location tracking.
Walkthrough: Creating the first archive
As with the git-annex walkthrough, I’ll set some variables to make it easy to remember:

$SOURCEDIR is the directory being backed up
$DRIVE is the directory for backups to be stored in. Since dar can split by a specified size, I don’t need to make separate filesystems to simulate the separate drive experience as I did with git-annex.
$CATDIR will hold isolated catalogs
$DARDB points to the dar_manager database

OK, we can run the backup immediately. No special setup is needed. dar supports both short-form (single-character) parameters and long-form ones. Since the parameters probably aren’t familiar to everyone, I will use the long-form ones in these examples.
Here’s how we create our initial full backup. I’ll explain the parameters below:
$ dar --verbose --create $DRIVE/bak1 --on-fly-isolate $CATDIR/bak1 --slice 400M --min-digits 2 --pause --fs-root $SOURCEDIR
Let’s look at each of these parameters:

–verbose does what you expect
–create selects the operation mode (like tar -c) and gives the archive basename
–on-fly-isolate says to write an isolated catalog as well, right while making the archive. You can always create an isolated catalog later (which is fast, since it only needs to read the last bits of the last slice) but it’s more convenient to do it now, so we do. We give the base name for the isolated catalog also.
–slice 400M says to split the archive, and create slices 400MB each.
–min-digits 2 pertains to naming files. Without it, dar would create files named bak1.dar.1, bak1.dar.2, bak1.dar.10, etc. dar works fine with this, but it can be annoying in ls. This is just convenience for humans.
–pause tells dar to pause after writing each slice. This would let us swap drives, burn discs, etc. I do this for demonstration purposes only; it isn’t strictly necessary in this situation. For a more powerful option, dar also supports –execute, which can run commands after each slice.
–fs-root gives the path to actually back up.

This same command could have been written with short options as:
$ dar -v -c $DRIVE/bak1 -@ $CATDIR/bak1 -s 400M -9 2 -p -R $SOURCEDIR
What does it look like while running? Here’s an excerpt:
... Adding file to archive: /acrypt/no-backup/jgoerzen/testdata/[redacted] Finished writing to file 1, ready to continue ? [return = YES | Esc = NO] ... Writing down archive contents... Closing the escape layer... Writing down the first archive terminator... Writing down archive trailer... Writing down the second archive terminator... Closing archive low layer... Archive is closed.
——————————————–
581 inode(s) saved
including 0 hard link(s) treated
0 inode(s) changed at the moment of the backup and could not be saved properly
0 byte(s) have been wasted in the archive to resave changing files
0 inode(s) with only metadata changed
0 inode(s) not saved (no inode/file change)
0 inode(s) failed to be saved (filesystem error)
0 inode(s) ignored (excluded by filters)
0 inode(s) recorded as deleted from reference backup
——————————————–
Total number of inode(s) considered: 581
——————————————–
EA saved for 0 inode(s)
FSA saved for 581 inode(s)
——————————————–
Making room in memory (releasing memory used by archive of reference)…
Now performing on-fly isolation…
…

That was easy! Let’s look at the contents of the backup directory:
$ ls -lh $DRIVE total 3.7G -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:27 bak1.01.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:27 bak1.02.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:27 bak1.03.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:27 bak1.04.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:28 bak1.05.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:28 bak1.06.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:28 bak1.07.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:28 bak1.08.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:29 bak1.09.dar -rw-r--r-- 1 jgoerzen jgoerzen 156M Jun 16 19:33 bak1.10.dar
And the isolated catalog:
$ ls -lh $CATDIR total 37K -rw-r--r-- 1 jgoerzen jgoerzen 35K Jun 16 19:33 bak1.1.dar
The isolated catalog is stored compressed automatically.
Well this was easy. With one command, we archived the entire data set, split into 400MB chunks, and wrote out the catalog data.
Walkthrough: Inspecting the saved archive
Can dar tell us which slice contains a given file? Sure:
$ dar --list $DRIVE/bak1 --list-format=slicing | less Slice(s)|[Data ][D][ EA ][FSA][Compr][S]|Permission| Filemane --------+--------------------------------+----------+----------------------------- ... 1 [Saved][ ] [-L-][ 0%][X] -rwxr--r-- [redacted] 1-2 [Saved][ ] [-L-][ 0%][X] -rwxr--r-- [redacted] 2 [Saved][ ] [-L-][ 0%][X] -rwxr--r-- [redacted] ...
This illustrates the transition from slice 1 to slice 2. The first file was stored entirely in slice 1; the second stored partially in slice 1 and partially in slice 2, and third solely in slice 2. We can get other kinds of information as well.
$ dar --list $DRIVE/bak1 | less [Data ][D][ EA ][FSA][Compr][S]| Permission | User | Group | Size | Date | filename --------------------------------+------------+-------+-------+---------+-------------------------------+------------ [Saved][ ] [-L-][ 0%][X] -rwxr--r-- jgoerzen jgoerzen 24 Mio Mon Mar 5 07:58:09 2018 [redacted] [Saved][ ] [-L-][ 0%][X] -rwxr--r-- jgoerzen jgoerzen 16 Mio Mon Mar 5 07:58:09 2018 [redacted] [Saved][ ] [-L-][ 0%][X] -rwxr--r-- jgoerzen jgoerzen 22 Mio Mon Mar 5 07:58:09 2018 [redacted]
These are the same files I was looking at before. Here we see they are 24MB, 16MB, and 22MB in size, and some additional metadata. Even more is available in the XML list format.
Walkthrough: updates
As with git-annex, I’ve made some changes in the source directory: moved a file, added another, and deleted one. Let’s create an incremental backup now:
$ dar --verbose --create $DRIVE/bak2 --on-fly-isolate $CATDIR/bak2 --ref $CATDIR/bak1 --slice 400M --min-digits 2 --pause --fs-root $SOURCEDIR
This command is very similar to the earlier one. Instead of writing an archive and catalog named bak1, we write one named bak2. What’s new here is –ref $CATDIR/bak1. That says, make an incremental based on an archive of reference. All that is needed from that archive of reference is the detached catalog. –ref $DRIVE/bak1 would have worked equally well here.
Here’s what I did to the $SOURCEDIR:

Renamed a file to file01-unchanged
Deleted a file
Copied /bin/cp to a file named cp

Let’s see if dar’s command output matches this:
... Adding file to archive: /acrypt/no-backup/jgoerzen/testdata/file01-unchanged Saving Filesystem Specific Attributes for /acrypt/no-backup/jgoerzen/testdata/file01-unchanged Adding file to archive: /acrypt/no-backup/jgoerzen/testdata/cp Saving Filesystem Specific Attributes for /acrypt/no-backup/jgoerzen/testdata/cp Adding folder to archive: [redacted] Saving Filesystem Specific Attributes for [redacted] Adding reference to files that have been destroyed since reference backup... ... -------------------------------------------- 3 inode(s) saved including 0 hard link(s) treated 0 inode(s) changed at the moment of the backup and could not be saved properly 0 byte(s) have been wasted in the archive to resave changing files 0 inode(s) with only metadata changed 578 inode(s) not saved (no inode/file change) 0 inode(s) failed to be saved (filesystem error) 0 inode(s) ignored (excluded by filters) 2 inode(s) recorded as deleted from reference backup -------------------------------------------- Total number of inode(s) considered: 583 -------------------------------------------- EA saved for 0 inode(s) FSA saved for 3 inode(s) -------------------------------------------- ...
Yes, it does. The rename is recorded as a deletion and an addition, since dar doesn’t directly track renames. So the rename plus the deletion account for the two deletions. The rename plus the addition of cp count as 2 of the 3 inodes saved; the third is the modified directory from which files were deleted and moved out.
Let’s see the files that were created:
$ ls -lh $DRIVE/bak2* -rw-r--r-- 1 jgoerzen jgoerzen 18M Jun 16 19:52 /acrypt/no-backup/jgoerzen/dar-testing/drive/bak2.01.dar $ ls -lh $CATDIR/bak2* -rw-r--r-- 1 jgoerzen jgoerzen 22K Jun 16 19:52 /acrypt/no-backup/jgoerzen/dar-testing/cat/bak2.1.dar
What does –list look like now?
Slice(s)|[Data ][D][ EA ][FSA][Compr][S]|Permission| Filemane --------+--------------------------------+----------+----------------------------- [ ][ ] [---][-----][X] -rwxr--r-- [redacted] 1 [Saved][ ] [-L-][ 0%][X] -rwxr--r-- file01-unchanged ... [--- REMOVED ENTRY ----][redacted] [--- REMOVED ENTRY ----][redacted]
Here I show an example of:

A file that was not changed from the initial backup. Its presence was simply noted, but because we’re doing an incremental, the data wasn’t saved.
A file that is saved in this incremental, on slice 1.
The two deleted files

Walkthrough: dar_manager
As we’ve seen above, the two archives (or their detached catalog) give us a complete picture of what files were present at the time of the creation of each archive, and what files were stored in a given archive. We can certainly continue working in that way. We can also use dar_manager to build a comprehensive database of these archives, to be able to find what media is necessary to restore each given file. Or, with dar_manager’s –when parameter, we can restore files as of a particular date.
Let’s try it out. First, we create our database:
$ dar_manager --create $DARDB $ dar_manager --base $DARDB --add $DRIVE/bak1 Auto detecting min-digits to be 2 $ dar_manager --base $DARDB --add $DRIVE/bak2 Auto detecting min-digits to be 2
Here we created the database, and added our two catalogs to it. (Again, we could have as easily used $CATDIR/bak1; either the archive or its isolated catalog will work here.) It’s important to add the catalogs in order.
Let’s do some quick experimentation with dar_manager:
$ dar_manager -v --base $DARDB --list Decompressing and loading database to memory...
dar path :
dar options :
database version : 6
compression used : gzip
compression level: 9
archive # | path | basename
————+————–+—————
1 /acrypt/no-backup/jgoerzen/dar-testing/drive bak1
2 /acrypt/no-backup/jgoerzen/dar-testing/drive bak2
$ dar_manager –base $DARDB –stat
archive # | most recent/total data | most recent/total EA
————–+————————-+———————–
1 580/581 0/0
2 3/3 0/0

The –list option shows the correlation between dar_manager archive number (1, 2) with filenames (bak1, bak2). It is coincidence here that 1/bak1 and 2/bak2 correlate; that’s not necessarily the case. Most dar_manager commands operate on archive number, while dar commands operate on archive path/basename.
Now let’s see just what files are saved in archive #2, the incremental:
$ dar_manager --base $DARDB --used 2 [ Saved ][ ] [redacted] [ Saved ][ ] file01-unchanged [ Saved ][ ] cp
Now we can also where a file is stored. Here’s one that was saved in the full backup and unmodified in the incremental:
$ dar_manager --base $DARDB --file [redacted] 1 Fri Jun 16 19:15:12 2023 saved absent 2 Fri Jun 16 19:15:12 2023 present absent
(The absent at the end refers to extended attributes that the file didn’t have)
Similarly, for files that were added or removed, they’ll be listed only at the appropriate place.
Walkthrough: Restoration
I’m not going to repeat the author’s full restoration with dar page, but here are some quick examples.
A simple way of doing everything is using incrementals for the whole series. To do that, you’d have bak1 be full, bak2 based on bak1, bak3 based on bak2, bak4 based on bak3, etc. To restore from such a series, you have two options:

Use dar to simply extract each archive in order. It will handle deletions, renames, etc. along the way.
Use dar_manager with the backup database to do manage the process. It may be somewhat more efficient, as it won’t bother to restore files that will later be modified or deleted.

If you get fancy — for instance, bak2 is based on bak1, bak3 on bak2, bak4 on bak1 — then you would want to use dar_manager to ensure a consistent restore is completed. Either way, the process is nearly identical. Also, I figure, to make things easy, you can save a copy of the entire set of isolated catalogs before you finalize each disc/drive. They’re so small, and this would let someone with just the most recent disc build a dar_manager database without having to go through all the other discs.
Anyhow, let’s do a restore using just dar. I’ll make a $RESTOREDIR and do it that way.
$ dar --verbose --extract $DRIVE/bak1 --fs-root $RESTOREDIR --no-warn --execute "echo Ready for slice %n. Press Enter; read foo"
This –execute lets us see how dar works; this is an illustration of the power it has (above –pause); it’s a snippet interpreted by /bin/sh with %n being one of the dar placeholders. If memory serves, it’s not strictly necessary, as dar will prompt you for slices it needs if they’re not mounted. Anyhow, you’ll see it first reading the last slice, which contains the catalog, then reading from the beginning.
Here we go:
Auto detecting min-digits to be 2 Opening archive bak1 ... Opening the archive using the multi-slice abstraction layer... Ready for slice 10. Press Enter ... Loading catalogue into memory... Locating archive contents... Reading archive contents... File ownership will not be restored du to the lack of privilege, you can disable this message by asking not to restore file ownership [return = YES | Esc = NO] Continuing... Restoring file's data: [redacted] Restoring file's FSA: [redacted] Ready for slice 1. Press Enter ... Ready for slice 2. Press Enter ... -------------------------------------------- 581 inode(s) restored including 0 hard link(s) 0 inode(s) not restored (not saved in archive) 0 inode(s) not restored (overwriting policy decision) 0 inode(s) ignored (excluded by filters) 0 inode(s) failed to restore (filesystem error) 0 inode(s) deleted -------------------------------------------- Total number of inode(s) considered: 581 -------------------------------------------- EA restored for 0 inode(s) FSA restored for 0 inode(s) --------------------------------------------
The warning is because I’m not doing the extraction as root, which limits dar’s ability to fully restore ownership data.
OK, now the incremental:
$ dar --verbose --extract $DRIVE/bak2 --fs-root $RESTOREDIR --no-warn --execute "echo Ready for slice %n. Press Enter; read foo" ... Ready for slice 1. Press Enter ... Restoring file's data: /acrypt/no-backup/jgoerzen/dar-testing/restore/file01-unchanged Restoring file's FSA: /acrypt/no-backup/jgoerzen/dar-testing/restore/file01-unchanged Restoring file's data: /acrypt/no-backup/jgoerzen/dar-testing/restore/cp Restoring file's FSA: /acrypt/no-backup/jgoerzen/dar-testing/restore/cp Restoring file's data: /acrypt/no-backup/jgoerzen/dar-testing/restore/[redacted directory] Removing file (reason is file recorded as removed in archive): [redacted file] Removing file (reason is file recorded as removed in archive): [redacted file]
This all looks right! Now how about we compare the restore to the original source directory?
$ diff -durN $SOURCEDIR $RESTOREDIR
No changes – perfect.
We could instead do this restore via a single dar_manager command, though annoyingly, we’d have to pass all top-level files/directories to dar_manager –restore. But still, it’s one command, and basically automates and optimizes the dar restores shown above.
Conclusions
Dar makes it extremely easy to just Do The Right Thing when making archives. One command makes a backup. It saves things in simple files. You can make an isolated catalog if you want, and it too is saved in a simple file. You can query what is in the files and where. You can restore from all or part of the files. You can simply play the backups forward, in order, to achieve a full and consistent restore. Or you can load data about them into dar_manager for an optimized restore.
A bit of scripting will be necessary to make incrementals; finding the most recent backup or catalog. If backup files are named with care — for instance, by date — then this should be a pretty easy task.
I haven’t touched on resiliency yet. dar comes with tools for recovering archives that have had portions corrupted or lost. It can also rebuild the catalog if it is corrupted or lost. It adds “tape marks” (or “escape sequences”) to the archive along with the data stream. So every entry in the catalog is actually stored in the archive twice: once alongside the file data, and once at the end in the collected catalog. This allows dar to scan a corrupted file for the tape marks and reconstruct whatever is still intact, even if the catalog is lost. dar also integrates with tools like sha256sum and par2 to simplify archive integrity testing and restoration.
This balances against the need to use a tool (dar, optionally with a GUI frontend) to restore files. I’ll discuss that more in the next post.

Reply
Pingback: Links 18/06/2023: More Microsoft Spyware and Windows Breaches | Techrights
roneo says:

July 7, 2023 at 12:18 am

I’m using rsync a lot, ans Grsync when i want a GUI

GNOME backintime is awesome for daily backups

Reply
John Goerzen says: @ changelog.complete.org

July 12, 2023 at 3:23 am

This is the fourth in a series about archiving to removable media (optical discs such as BD-Rs and DVD+Rs or portable hard drives). Here are the first three parts:

In part 1, I laid out my goals for the project, and considered a number of tools before determining dar and git-annex were my leading options.
In part 2, I took a deep dive into git-annex and simulated using it for this project.
In part 3, I did the same with dar.
And in this part, I want to put it together to come up with an initial direction to pursue.

I want to state at the outset that this is not a general review of dar or git-annex. This is an analysis of how those tools stack up to a particular use case. Neither tool focuses on this use case, and I note it is particularly far from the more common uses of git-annex. For instance, both tools offer support for cloud storage providers and special support for ssh targets, but neither of those are in-scope for this post.
Comparison Matrix
As part of this project, I made a comparison matrix which includes not just dar and git-annex, but also backuppc, bacula/bareos, and borg. This may give you some good context, and also some reference for other projects in this general space.
Reviewing the Goals
I identified some goals in part 1. They are all valid. As I have thought through the project more, I feel like I should condense them into a simpler ordered list, with the first being the most important. I omit some things here that both dar and git-annex can do (updates/incrementals, for instance; see the expanded goals list in part 1). Here they are:

The tool must not modify the source data in any way.
It must be simple to create or update an archive. Processes that require a lot of manual work, are flaky, or are difficult to do correctly, are unlikely to be done correctly and often. If it’s easy to do right, I’m more likely to do it. Put another way: an archive never created can never be restored.
The chances of a successful restore by someone that is not me, that doesn’t know Linux, and is at least 10 years in the future, should be maximized. This implies a simple toolset, solid support for dealing with media errors or missing media, etc.
Both a partial point-in-time restore and a full restore should be possible. The full restore must, at minimum, provide a consistent directory tree; that is, deletions, additions, and moves over time must be accurately reflected. Preserving modification times is a near-requirement, and preserving hard links, symbolic links, and other POSIX metadata is a significant nice-to-have.
There must be a strategy to provide redundancy; for instance, a way for one set of archive discs to be offsite, another onsite, and the two to be periodically swapped.
Use storage space efficiently.

Let’s take a look at how the two stack up against these goals.
Goal 1: Not modifying source data
With dar, this is accomplished. dar –create does not modify source data (and even has a mode to avoid updating atime) so that’s done.
git-annex normally does modify source data, in that it typically replaces files with symlinks into its hash-indexed storage directory. It can instead use hardlinks. In either case, you will wind up with files that have identical content (but may have originally been separate, non-linked files) linked together with git-annex. This would cause me trouble, as well as run the risk of modifying timestamps. So instead of just storing my data under a git-annex repo as is its most common case, I use the directory special remote with importtree=yes to sort of “import” the data in. This, plus my desire to have the repos sensible and usable on non-POSIX operating systems, accounts for a chunk of the git-annex complexity you see here. You wouldn’t normally see as much complexity with git-annex (though, as you will see, even without the directory special remote, dar still has less complexity).
Winner: dar, though I demonstrated a working approach with git-annex as well.
Goal 2: Simplicity of creating or updating an archive
Let us simply start by recognizing this:

Number of commands to create a first dar archive, including all splits: 1
Number of commands to create a first git-annex archive, with just the first two splits: 58
Number of commands to create a dar incremental: 1
Number of commands to update the last git-annex drive: 10
Number of commands to do a full restore of all slices and both archives with dar: 2 (1 if dar_manager used)
Number of commands to do a full restore of just the first two drive with git-annex: 9 (but my process may not be correct)

Both tools have a lot of power, but I must say, it is easier to wrap my head around what dar is doing than what git-annex is doing. Everything dar does is with files: here are the files to archive, here is an archive file, here is a detached (isolated) catalog. It is very straightforward. It took me far less time to develop my dar page than my git-annex page, despite having existing familiarity with both tools. As I pointed out in part 2, I still don’t fully understand how git-annex syncs metadata. Unsolved mysteries from that post include why the two git-annex drives had no idea what was on the other drives, and why the export operation silenty did nothing. Additionally, for the optical disc case, I had to create a restricted-size filesystem/dataset for git-annex to write into in order to get the desired size limit.
Looking at the optical disc case, dar has a lot of nice infrastructure built in. With –pause and –execute, it can very easily be combined with disc burning operations. –slice will automatically limit the size of a given slice, regardless of how much disk space is free, meaning that the git-annex tricks of creating smaller filesystems/datasets are unnecessary with dar.
To create an initial full backup with dar, you just give it the size of the device, and it will automatically split up the archive, with hooks to integrate for burning or changing drives. About as easy as you could get.
With git-annex, you would run the commands to have it fill up the initial filesystem, then burn the disc (or remove the drive), then run the commands to create another repo on the second filesystem, and so forth.
With hard drives, with git-annex you would do something similar; let it fill up a repo on a drive, and if it exits with a space error, swap in the next. With dar, you would slice as with an optical disk. Dar’s slicing is less convenient in this case, though, as it assumes every drive is the same size — and yours may not be. You could work around that by using a slice size no bigger than the smallest drive, and putting multiple slices on larger drives if need be. If a single drive is large enough to hold your entire data set, though, you need not worry about this with either tool.
Here’s a warning about git-annex: it won’t store anything beneath directories named .git. My use case doesn’t have many of those. If your use case does, you’re going to have to figure out what to do about it. Maybe rename them to something else while the backup runs? In any case, it is simply a fact that git-annex cannot back up git repositories, and this cuts against being able to back up things correctly.
Another point is that git-annex has scalability concerns. If your archive set gets into the hundreds of thousands of files, you may need to split it into multiple distinct git-annex repositories. If this occurs — and it will in my case — it may serve to dull the shine of some of git-annex’s features such as location tracking.
A detour down the update strategies path
Update strategies get a little more complicated with both. First, let’s consider: what exactly should our update strategy be?
For optical discs, I might consider doing a monthly update. I could burn a disc (or more than one, if needed) regardless of how much data is going to go onto it, because I want no more than a month’s data lost in any case. An alternative might be to spool up data until I have a disc’s worth, and then write that, but that could possibly mean months between actually burning a disc. Probably not good.
For removable drives, we’re unlikely to use a new drive each month. So there it makes sense to continue writing to the drive until it’s full. Now we have a choice: do we write and preserve each month’s updates, or do we eliminate intermediate changes and just keep the most recent data?
With both tools, the monthly burn of an optical disc turns out to be very similar to the initial full backup to optical disc. The considerations for spanning multiple discs are the same. With both tools, we would presumably want to keep some metadata on the host so that we don’t have to refer to a previous disc to know what was burned. In the dar case, that would be an isolated catalog. For git-annex, it would be a metadata-only repo. I illustrated both of these in parts 2 and 3.
Now, for hard drives. Assuming we want to continue preserving each month’s updates, with dar, we could just write an incremental to the drive each month. Assuming that the size of the incremental is likely far smaller than the size of the drive, you could easily enough do this. More fancily, you could look at the free space on the drive and tell dar to use that as the size of the first slice. For git-annex, you simply avoid calling drop/dropunused. This will cause the old versions of files to accumulate in .git/annex. You can get at them with git annex commands. This may imply some degree of elevated risk, as you are modifying metadata in the repo each month, which with dar you could chmod a-w or even chattr +i the archive files once written. Hopefully this elevated risk is low.
If you don’t want to preserve each month’s updates, with dar, you could just write an incremental each month that is based on the previous drive’s last backup, overwriting the previous. That implies some risk of drive failure during the time the overwrite is happening. Alternatively, you could write an incremental and then use dar to merge it into the previous incremental, creating a new one. This implies some degree of extra space needed (maybe on a different filesystem) while doing this. With git-annex, you would use drop/dropunused as I demonstrated in part 2.
The winner for goal 2 is dar. The gap is biggest with optical discs and more narrow with hard drives, thanks to git-annex’s different options for updates. Still, I would be more confident I got it right with dar.
Goal 3: Greatest chance of successful restore in the distant future
If you use git-annex like I suggested in part 2, you will have a set of discs or drives that contain a folder structure with plain files in them. These files can be opened without any additional tools at all. For sheer ability to get at raw data, git-annex has the edge.
When you talk about getting a consistent full restore — without multiple copies of renamed files or deleted files coming back — then you are going to need to use git-annex to do that.
Both git-annex and dar provide binaries. Dar provides a win64 version on its Sourceforge page. On the author’s releases site, you can find the win64 version in addition to a statically-linked x86_64 version for Linux. The git-annex install page mostly directs you to package managers for your distribution, but the downloads page also lists builds for Linux, Windows, and Mac OS X. The Linux version is dynamic, but ships most of its .so files alongside. The Windows version requires cygwin.dll, and all versions require you to also install git itself. Both tools are in package managers for Mac OS X, Debian, FreeBSD, and so forth. Let’s just say that you are likely to be able to run either one on a future Windows or Linux system.
There are also GUI frontends for dar, such as DARGUI and gdar. This can increase the chances of a future person being able to use the software easily. git-annex has the assistant, which is based on a different use case and probably not directly helpful here.
When it comes to doing the actual restore process using software, dar provides the easier process here.
For dealing with media errors and the like, dar can integrate with par2. While technically you could use par2 against the files git-annex writes, that’s more cumbersome to manage to the point that it is likely not to be done. Both tools can deal reasonably with missing media entirely.
I’m going to give the edge on this one to git-annex; while dar does provide the easier restore and superior tools for recovering from media errors, the ability to access raw data as plain files without any tools at all is quite compelling. I believe it is the most critical advantage git-annex has, and it’s a big one.
Goal 4: Support high-fidelity partial and full restores
Both tools make it possible to do a full restore reflecting deletions, additions, and so forth. Dar, as noted, is easier for this, but it is possible with git-annex. So, both can achieve a consistent restore.
Part of this goal deals with fidelity of the restore: preserving timestamps, hard and symbolic links, ownership, permissions, etc. Of these, timestamps are the most important for me.
git-annex can’t do any of that. dar does all of it.
Some of this can be worked around using mtree as I documented in part 2. However, that implies a need to also provide mtree on the discs for future users, and I’m not sure mtree really exists for Windows. It also cuts against the argument that git-annex discs can be used without any tools. It is true, they can, but all you will get is filename and content; no accurate date. Timestamps are often highly relevant for everything from photos to finding an elusive document or record.
Winner: dar.
Goal 5: Supporting backup strategies with redundancy
My main goal here is to have two separate backup sets: one that is offsite, and one that is onsite. Depending on the strategy and media, they might just always stay that way, or periodically rotate. For instance, with optical discs, you might just burn two copies of every disc and store one at each place. For hard drives, since you will be updating the content of them, you might swap them periodically.
This is possible with both tools. With both tools, if using the optical disc scheme I laid out, you can just burn two identical copies of each disc.
With the hard drive case, with dar, you can keep two directories of isolated catalogs, one for each drive set. A little identifier file on each drive will let you know which set to use.
git-annex can track locations itself. As I demonstrated in part 2, you can make each drive its own repo, add all drives from a given drive set to a git-annex group. When initializing a drive, you tell git-annex what group it’s a prt of. From then on, git-annex knows what content is in each group and will add whatever a given drive’s group needs to that drive.
It’s possible to do this with both, but the winner here is git-annex.
Goal 6: Efficient use of storage
Here are situations in which one or the other will be more efficient:

Lots of small files: dar, due to reduced filesystem overhead
Compressible data: dar (git-annex doesn’t support compression)
Renamed files: git-annex (it will detect the sha256 match and avoid storing a duplicate copy)
Identical files: git-annex, unless they are hardlinked already (again, detects the sha256 match)
Small modifications to files (eg, ID3 tags on MP3s, EXIF data on photos, etc): dar (it supports rsync-style binary deltas)

The winner depends on your particular situation.
Other notes
While not part of the goals above, dar is capable of using tapes directly. While not as common, they are often used in communities of people that archive lots of data.
Conclusions
Overall, dar is the winner for me. It is simpler in most areas, easier to get correct, and scales very well.
git-annex does, however, have some quite compelling points. Being able to access files as plain files is huge, and its location tracking is nicer than dar’s, even when using dar_manager.
Both tools are excellent and I recommend them both – and for more than the particular scenario shown here. Both have fantastic and responsive authors.

Reply