I have several TB worth of family photos, videos, and other data. This needs to be backed up — and archived.
Backups and archives are often thought of as similar. And indeed, they may be done with the same tools at the same time. But the goals differ somewhat:
Backups are designed to recover from a disaster that you can fairly rapidly detect.
Archives are designed to survive for many years, protecting against disaster not only impacting the original equipment but also the original person that created them.
Reflecting on this, it implies that while a nice ZFS snapshot-based scheme that supports twice-hourly backups may be fantastic for that purpose, if you think about things like family members being able to access it if you are incapacitated, or accessibility in a few decades’ time, it becomes much less appealing for archives. ZFS doesn’t have the wide software support that NTFS, FAT, UDF, ISO-9660, etc. do.
This post isn’t about the pros and cons of the different storage media, nor is it about the pros and cons of cloud storage for archiving; these conversations can readily be found elsewhere. Let’s assume, for the point of conversation, that we are considering BD-R optical discs as well as external HDDs, both of which are too small to hold the entire backup set.
What would you use for archiving in these circumstances?
Establishing goals
The goals I have are:
- Archives can be restored using Linux or Windows (even though I don’t use Windows, this requirement will ensure the broadest compatibility in the future)
- The archival system must be able to accommodate periodic updates consisting of new files, deleted files, moved files, and modified files, without requiring a rewrite of the entire archive dataset
- Archives can ideally be mounted on any common OS and the component files directly copied off
- Redundancy must be possible. In the worst case, one could manually copy one drive/disc to another. Ideally, the archiving system would automatically track making n copies of data.
- While a full restore may be a goal, simply finding one file or one directory may also be a goal. Ideally, an archiving system would be able to quickly tell me which discs/drives contain a given file.
- Ideally, preserves as much POSIX metadata as possible (hard links, symlinks, modification date, permissions, etc). However, for the archiving case, this is less important than for the backup case, with the possible exception of modification date.
- Must be easy enough to do, and sufficiently automatable, to allow frequent updates without error-prone or time-consuming manual hassle
I would welcome your ideas for what to use. Below, I’ll highlight different approaches I’ve looked into and how they stack up.
Basic copies of directories
The initial approach might be one of simply copying directories across. This would work well if the data set to be archived is smaller than the archival media. In that case, you could just burn or rsync a new copy with every update and be done. Unfortunately, this is much less convenient with data of the size I’m dealing with. rsync is unavailable in that case. With some datasets, you could manually design some rsyncs to store individual directories on individual devices, but that gets unwieldy fast and isn’t scalable.
You could use something like my datapacker program to split the data across multiple discs/drives efficiently. However, updates will be a problem; you’d have to re-burn the entire set to get a consistent copy, or rely on external tools like mtree to reflect deletions. Not very convenient in any case.
So I won’t be using this.
tar or zip
While you can split tar and zip files across multiple media, they have a lot of issues. GNU tar’s incremental mode is clunky and buggy; zip is even worse. tar files can’t be read randomly, making it extremely time-consuming to extract just certain files out of a tar file.
The only thing going for these formats (and especially zip) is the wide compatibility for restoration.
dar
Here we start to get into the more interesting tools. Dar is, in my opinion, one of the best Linux tools that few people know about. Since I first wrote about dar in 2008, it’s added some interesting new features; among them, binary deltas and cloud storage support. So, dar has quite a few interesting features that I make use of in other ways, and could also be quite helpful here:
- Dar can both read and write files sequentially (streaming, like tar), or with random-access (quick seek to extract a subset without having to read the entire archive)
- Dar can apply compression to individual files, rather than to the archive as a whole, faciliting both random access and resilience (corruption in one file doesn’t invalidate all subsequent files). Dar also supports numerous compression algorithms including gzip, bzip2, xz, lzo, etc., and can omit compressing already-compressed files.
- The end of each dar file contains a central directory (dar calls this a catalog). The catalog contains everything necessary to extract individual files from the archive quickly, as well as everything necessary to make a future incremental archive based on this one. Additionally, dar can make and work with “isolated catalogs” — a file containing the catalog only, without data.
- Dar can split the archive into multiple pieces called slices. This can best be done with fixed-size slices (–slice and –first-slice options), which let the catalog regord the slice number and preserves random access capabilities. With the –execute option, dar can easily wait for a given slice to be burned, etc.
- Dar normally stores an entire new copy of a modified file, but can optionally store an rdiff binary delta instead. This has the potential to be far smaller (think of a case of modifying metadata for a photo, for instance).
Additionally, dar comes with a dar_manager program. dar_manager makes a database out of dar catalogs (or archives). This can then be used to identify the precise archive containing a particular version of a particular file.
All this combines to make a useful system for archiving. Isolated catalogs are tiny, and it would be easy enough to include the isolated catalogs for the entire set of archives that came before (or even the dar_manager database file) with each new incremental archive. This would make restoration of a particular subset easy.
The main thing to address with dar is that you do need dar to extract the archive. Every dar release comes with source code and a win64 build. dar also supports building a statically-linked Linux binary. It would therefore be easy to include win64 binary, Linux binary, and source with every archive run. dar is also a part of multiple Linux and BSD distributions, which are archived around the Internet. I think this provides a reasonable future-proofing to make sure dar archives will still be readable in the future.
The other challenge is user ability. While dar is highly portable, it is fundamentally a CLI tool and will require CLI abilities on the part of users. I suspect, though, that I could write up a few pages of instructions to include and make that a reasonably easy process. Not everyone can use a CLI, but I would expect a person that could follow those instructions could be readily-enough found.
One other benefit of dar is that it could easily be used with tapes. The LTO series is liked by various hobbyists, though it could pose formidable obstacles to non-hobbyists trying to aceess data in future decades. Additionally, since the archive is a big file, it lends itself to working with par2 to provide redundancy for certain amounts of data corruption.
git-annex
git-annex is an interesting program that is designed to facilitate managing large sets of data and moving it between repositories. git-annex has particular support for offline archive drives and tracks which drives contain which files.
The idea would be to store the data to be archived in a git-annex repository. Then git-annex commands could generate filesystem trees on the external drives (or trees to br burned to read-only media).
In a post about using git-annex for blu-ray backups, an earlier thread about DVD-Rs was mentioned.
This has a few interesting properties. For one, with due care, the files can be stored on archival media as regular files. There are some different options for how to generate the archives; some of them would place the entire git-annex metadata on each drive/disc. With that arrangement, one could access the individual files without git-annex. With git-annex, one could reconstruct the final (or any intermediate) state of the archive appropriately, handling deltions, renames, etc. You would also easily be able to know where copies of your files are.
The practice is somewhat more challenging. Hundreds of thousands of files — what I would consider a medium-sized archive — can pose some challenges, running into hours-long execution if used in conjunction with the directory special remote (but only minutes-long with a standard git-annex repo).
Ruling out the directory special remote, I had thought I could maybe just work with my files in git-annex directly. However, I ran into some challenges with that approach as well. I am uncomfortable with git-annex mucking about with hard links in my source data. While it does try to preserve timestamps in the source data, these are lost on the clones. I wrote up my best effort to work around all this.
In a forum post, the author of git-annex comments that “I don’t think that CDs/DVDs are a particularly good fit for git-annex, but it seems a couple of users have gotten something working.” The page he references is Managing a large number of files archived on many pieces of read-only medium. Some of that discussion is a bit dated (for instance, the directory special remote has the importtree feature that implements what was being asked for there), but has some interesting tips.
git-annex supplies win64 binaries, and git-annex is included with many distributions as well. So it should be nearly as accessible as dar in the future. Since git-annex would be required to restore a consistent recovery image, similar caveats as with dar apply; CLI experience would be needed, along with some written instructions.
Bacula and BareOS
Although primarily tape-based archivers, these do also also nominally support drives and optical media. However, they are much more tailored as backup tools, especially with the ability to pull from multiple machines. They require a database and extensive configuration, making them a poor fit for both the creation and future extractability of this project.
Conclusions
I’m going to spend some more time with dar and git-annex, testing them out, and hope to write some future posts about my experiences.
@jgoerzen I’ve recently been through this cleaning out old data and building a new archive. I went with git-annex, now I’m trying to figure out how to get my data out of it. cp -L seems to be my saving grace to follow symbolic links that are everywhere. Some things I’m doing for my next iteration: – Compartmentalizing to a Toughbook dedicated to organizing my archive- Using M-DISC’s for cold data instead of BlueRay (due to rapid bit rot) – Using NVME to USB external drives for hot/warm data – … TBD – hashes and par2 or some sort of recovery mechanism in case of bit rot.
@mj git-annex’s unlocked mode (combined with thin) may help you there.There’s a lot of conversation on r/DataHoarders about M-DISC and whether or not they’re still worth it in the BD-R era, since it seems like BD-Rs already use a chemistry that is much more like what M-DISC was doing for DVDs. IOW, they may not add much now. Still, I may try them out, at least for one copy. I think I may burn at least 2 copies of each disc, using different brands of media.
@jgoerzen Interesting, I’ll have to check it out, I’ve been trying to figure out my next archive kit, thanks. I expect they’ll only live about a decade or so in my archive before I replace them with something new, that’s the biggest thing I learned after making my last backup station was portability and frequent upgrades. Thanks for the git-annex tip, I was sure there was something easy I was missing.
Hello,
I simply like restic…
@jgoerzenIf using optical media for long term archives (>10 years), you should acquire “M-Disc” BD-R media to use. They don’t fade like dye-based media.I use them for my annual archives.As for the data, I stage archives on a scratch drive and then burn them to disk with K3B or Brazero. As you point out, archives are a whole different mindset from regular backups.And thanks for the onfo about Dar. I should look into that!
We use Restic to back up Windows fileservers.
I don’t think that Restic will work well with optical discs that are smaller than the backup set.
@jgoerzen one piece of advice: don’t use git-annex’s encryption systems. it’s clunky, confusing, and really error-prone. it’s nice to have encryption at rest, but if i would really need this in a next project, i’d use something like tomb AKA just LUKS… or a borg backend.
@jgoerzen also, in my experience the main problem with git-annex (on top of usability, which can be nightmarish if you don’t know git well enough) is that it can’t handle a large number of files reasonably enough. things become slow quick, so i have multiple git-annex repositories and syncing those is a problem on its own.
I’m using rsync a lot, ans Grsync when i want a GUI
GNOME backintime is awesome for daily backups