Using dar for Data Archiving

This is the third post in a series about data archiving to removable media (optical discs and hard drives). In the first, I explained the difference between backing up and archiving, established goals for the project, and said I’d evaluate git-annex and dar. The second post evaluated git-annex, and now it’s time to look at dar. The series will conclude with a post comparing git-annex with dar.

What is dar?

I could open with the same thing I did with git-annex, just changing the name of the program: “[dar] is a fantastic and versatile program that does… well, it’s one of those things that can do so much that it’s a bit hard to describe.” It is, fundamentally, an archiver like tar or zip (makes one file representing a bunch of other files), but it goes far beyond that. dar’s homepage lays out a comprehensive list of features, which I will try to summarize here.

Dar itself is both a library (with C++ and Python bindings) for interacting with data, and a CLI tool (dar itself).
Alongside this, there is an ecosystem of tools around dar, including GUIs for multiple platforms, backup scripts, and FUSE implementations.
Dar is like tar in that it can read and write files sequentially if desired. Dar archives can be streamed, just like tar archives. But dar takes it further; if you have dar_slave on the remote end, random access is possible over ssh (dramatically speeding up certain operations).
Dar is like zip in that a dar archive contains a central directory (called a catalog) which permits random access to the contents of an archive. In other words, you don’t have to read an entire archive to extract just one file (assuming the archive is on disk or something that itself permits random access). Also, dar can compress each file individually, rather than the tar approach of compressing the archive as a whole. This increases archive performance (dar knows not to try to compress already-compressed data), boosts restore resilience (corruption of one part of an archive doesn’t invalidate the entire rest of it), and boosts restore performance (permitting random access).
Dar can split an archive into multiple pieces called slices, and it can even split member files among the slices. The catalog contains information allowing you to know which slice(s) a given file is saved in.
The catalog can also be saved off in a file of its own (dar calls this an “isolated catalog”). Isolated catalogs record just metadata about files archived.
dar_manager can assemble a database by reading archives or isolated catalogs, letting you know where files are stored and facilitating restores using the minimal number of discs.
Dar supports differential/incremental backups, which record changes since the last backup. These backups record not just additions, but also deletions. dar can optionally use rsync-style binary deltas to minimize the space needed to record changes. Dar does not suffer from GNU tar’s data loss bug with incrementals.
Dar can “slice and dice” archives like Perl does strings. The usage notes page shows how you can merge archives, create decremental archives (where the full backup always reflects the current state of the system, and incrementals go backwards in time instead of forwards), etc. You can change the compression algorithm on an existing archive, re-slice it, etc.
Dar is extremely careful about preserving all metadata: hard links, sparse files, symlinks, timestamps (including subsecond resolution), EAs, POSIX ACLs, resource forks on Mac, detecting files being modified while being read, etc. It makes a nice way to copy directories, sort of similar to rsync -avxHAXS.

So to tie this together for this project, I will set up a 400MB slice size (to mimic what I did with git-annex), and see how dar saves the data and restores it.

Isolated cataloges aren’t strictly necessary for this, but by using them (and/or dar_manager), we can build up a database of files and locations and thus directly compare dar to git-annex location tracking.

Walkthrough: Creating the first archive

As with the git-annex walkthrough, I’ll set some variables to make it easy to remember:

$SOURCEDIR is the directory being backed up
$DRIVE is the directory for backups to be stored in. Since dar can split by a specified size, I don’t need to make separate filesystems to simulate the separate drive experience as I did with git-annex.
$CATDIR will hold isolated catalogs
$DARDB points to the dar_manager database

OK, we can run the backup immediately. No special setup is needed. dar supports both short-form (single-character) parameters and long-form ones. Since the parameters probably aren’t familiar to everyone, I will use the long-form ones in these examples.

Here’s how we create our initial full backup. I’ll explain the parameters below:

$ dar \ --verbose \ --create $DRIVE/bak1 \ --on-fly-isolate $CATDIR/bak1 \ --slice 400M \ --min-digits 2 \ --pause \ --fs-root $SOURCEDIR

Let’s look at each of these parameters:

–verbose does what you expect
–create selects the operation mode (like tar -c) and gives the archive basename
–on-fly-isolate says to write an isolated catalog as well, right while making the archive. You can always create an isolated catalog later (which is fast, since it only needs to read the last bits of the last slice) but it’s more convenient to do it now, so we do. We give the base name for the isolated catalog also.
–slice 400M says to split the archive, and create slices 400MB each.
–min-digits 2 pertains to naming files. Without it, dar would create files named bak1.dar.1, bak1.dar.2, bak1.dar.10, etc. dar works fine with this, but it can be annoying in ls. This is just convenience for humans.
–pause tells dar to pause after writing each slice. This would let us swap drives, burn discs, etc. I do this for demonstration purposes only; it isn’t strictly necessary in this situation. For a more powerful option, dar also supports –execute, which can run commands after each slice.
–fs-root gives the path to actually back up.

This same command could have been written with short options as:

$ dar -v -c $DRIVE/bak1 -@ $CATDIR/bak1 -s 400M -9 2 -p -R $SOURCEDIR

What does it look like while running? Here’s an excerpt:

... Adding file to archive: /acrypt/no-backup/jgoerzen/testdata/[redacted] Finished writing to file 1, ready to continue ? [return = YES | Esc = NO] ... Writing down archive contents... Closing the escape layer... Writing down the first archive terminator... Writing down archive trailer... Writing down the second archive terminator... Closing archive low layer... Archive is closed.

-------------------------------------------- 581 inode(s) saved including 0 hard link(s) treated 0 inode(s) changed at the moment of the backup and could not be saved properly 0 byte(s) have been wasted in the archive to resave changing files 0 inode(s) with only metadata changed 0 inode(s) not saved (no inode/file change) 0 inode(s) failed to be saved (filesystem error) 0 inode(s) ignored (excluded by filters) 0 inode(s) recorded as deleted from reference backup -------------------------------------------- Total number of inode(s) considered: 581 -------------------------------------------- EA saved for 0 inode(s) FSA saved for 581 inode(s) -------------------------------------------- Making room in memory (releasing memory used by archive of reference)... Now performing on-fly isolation... ...

That was easy! Let’s look at the contents of the backup directory:

$ ls -lh $DRIVE total 3.7G -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:27 bak1.01.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:27 bak1.02.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:27 bak1.03.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:27 bak1.04.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:28 bak1.05.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:28 bak1.06.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:28 bak1.07.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:28 bak1.08.dar -rw-r--r-- 1 jgoerzen jgoerzen 400M Jun 16 19:29 bak1.09.dar -rw-r--r-- 1 jgoerzen jgoerzen 156M Jun 16 19:33 bak1.10.dar

And the isolated catalog:

$ ls -lh $CATDIR total 37K -rw-r--r-- 1 jgoerzen jgoerzen 35K Jun 16 19:33 bak1.1.dar

The isolated catalog is stored compressed automatically.

Well this was easy. With one command, we archived the entire data set, split into 400MB chunks, and wrote out the catalog data.

Walkthrough: Inspecting the saved archive

Can dar tell us which slice contains a given file? Sure:

$ dar --list $DRIVE/bak1 --list-format=slicing | less Slice(s)|[Data ][D][ EA ][FSA][Compr][S]|Permission| Filemane --------+--------------------------------+----------+----------------------------- ... 1 [Saved][ ] [-L-][ 0%][X] -rwxr--r-- [redacted] 1-2 [Saved][ ] [-L-][ 0%][X] -rwxr--r-- [redacted] 2 [Saved][ ] [-L-][ 0%][X] -rwxr--r-- [redacted] ...

This illustrates the transition from slice 1 to slice 2. The first file was stored entirely in slice 1; the second stored partially in slice 1 and partially in slice 2, and third solely in slice 2. We can get other kinds of information as well.

$ dar --list $DRIVE/bak1 | less [Data ][D][ EA ][FSA][Compr][S]| Permission | User | Group | Size | Date | filename --------------------------------+------------+-------+-------+---------+-------------------------------+------------ [Saved][ ] [-L-][ 0%][X] -rwxr--r-- jgoerzen jgoerzen 24 Mio Mon Mar 5 07:58:09 2018 [redacted] [Saved][ ] [-L-][ 0%][X] -rwxr--r-- jgoerzen jgoerzen 16 Mio Mon Mar 5 07:58:09 2018 [redacted] [Saved][ ] [-L-][ 0%][X] -rwxr--r-- jgoerzen jgoerzen 22 Mio Mon Mar 5 07:58:09 2018 [redacted]

These are the same files I was looking at before. Here we see they are 24MB, 16MB, and 22MB in size, and some additional metadata. Even more is available in the XML list format.

Walkthrough: updates

As with git-annex, I’ve made some changes in the source directory: moved a file, added another, and deleted one. Let’s create an incremental backup now:

$ dar \ --verbose \ --create $DRIVE/bak2 \ --on-fly-isolate $CATDIR/bak2 \ --ref $CATDIR/bak1 \ --slice 400M \ --min-digits 2 \ --pause \ --fs-root $SOURCEDIR

This command is very similar to the earlier one. Instead of writing an archive and catalog named bak1, we write one named bak2. What’s new here is --ref $CATDIR/bak1. That says, make an incremental based on an archive of reference. All that is needed from that archive of reference is the detached catalog. --ref $DRIVE/bak1 would have worked equally well here.

Here’s what I did to the $SOURCEDIR:

Renamed a file to file01-unchanged
Deleted a file
Copied /bin/cp to a file named cp

Let’s see if dar’s command output matches this:

... Adding file to archive: /acrypt/no-backup/jgoerzen/testdata/file01-unchanged Saving Filesystem Specific Attributes for /acrypt/no-backup/jgoerzen/testdata/file01-unchanged Adding file to archive: /acrypt/no-backup/jgoerzen/testdata/cp Saving Filesystem Specific Attributes for /acrypt/no-backup/jgoerzen/testdata/cp Adding folder to archive: [redacted] Saving Filesystem Specific Attributes for [redacted] Adding reference to files that have been destroyed since reference backup... ... -------------------------------------------- 3 inode(s) saved including 0 hard link(s) treated 0 inode(s) changed at the moment of the backup and could not be saved properly 0 byte(s) have been wasted in the archive to resave changing files 0 inode(s) with only metadata changed 578 inode(s) not saved (no inode/file change) 0 inode(s) failed to be saved (filesystem error) 0 inode(s) ignored (excluded by filters) 2 inode(s) recorded as deleted from reference backup -------------------------------------------- Total number of inode(s) considered: 583 -------------------------------------------- EA saved for 0 inode(s) FSA saved for 3 inode(s) -------------------------------------------- ...

Yes, it does. The rename is recorded as a deletion and an addition, since dar doesn’t directly track renames. So the rename plus the deletion account for the two deletions. The rename plus the addition of cp count as 2 of the 3 inodes saved; the third is the modified directory from which files were deleted and moved out.

Let’s see the files that were created:

$ ls -lh $DRIVE/bak2* -rw-r--r-- 1 jgoerzen jgoerzen 18M Jun 16 19:52 /acrypt/no-backup/jgoerzen/dar-testing/drive/bak2.01.dar $ ls -lh $CATDIR/bak2* -rw-r--r-- 1 jgoerzen jgoerzen 22K Jun 16 19:52 /acrypt/no-backup/jgoerzen/dar-testing/cat/bak2.1.dar

What does –list look like now?

Slice(s)|[Data ][D][ EA ][FSA][Compr][S]|Permission| Filemane --------+--------------------------------+----------+----------------------------- [ ][ ] [---][-----][X] -rwxr--r-- [redacted] 1 [Saved][ ] [-L-][ 0%][X] -rwxr--r-- file01-unchanged ... [--- REMOVED ENTRY ----][redacted] [--- REMOVED ENTRY ----][redacted]

Here I show an example of:

A file that was not changed from the initial backup. Its presence was simply noted, but because we’re doing an incremental, the data wasn’t saved.
A file that is saved in this incremental, on slice 1.
The two deleted files

Walkthrough: dar_manager

As we’ve seen above, the two archives (or their detached catalog) give us a complete picture of what files were present at the time of the creation of each archive, and what files were stored in a given archive. We can certainly continue working in that way. We can also use dar_manager to build a comprehensive database of these archives, to be able to find what media is necessary to restore each given file. Or, with dar_manager’s –when parameter, we can restore files as of a particular date.

Let’s try it out. First, we create our database:

$ dar_manager --create $DARDB $ dar_manager --base $DARDB --add $DRIVE/bak1 Auto detecting min-digits to be 2 $ dar_manager --base $DARDB --add $DRIVE/bak2 Auto detecting min-digits to be 2

Here we created the database, and added our two catalogs to it. (Again, we could have as easily used $CATDIR/bak1; either the archive or its isolated catalog will work here.) It’s important to add the catalogs in order.

Let’s do some quick experimentation with dar_manager:

$ dar_manager -v --base $DARDB --list Decompressing and loading database to memory...


dar path         :

dar options      :

database version : 6

compression used : gzip

compression level: 9
archive #   |    path      |    basename

------------+--------------+---------------

        1       /acrypt/no-backup/jgoerzen/dar-testing/drive    bak1

        2       /acrypt/no-backup/jgoerzen/dar-testing/drive    bak2

$ dar_manager --base $DARDB --stat archive # | most recent/total data | most recent/total EA --------------+-------------------------+----------------------- 1 580/581 0/0 2 3/3 0/0

The –list option shows the correlation between dar_manager archive number (1, 2) with filenames (bak1, bak2). It is coincidence here that 1/bak1 and 2/bak2 correlate; that’s not necessarily the case. Most dar_manager commands operate on archive number, while dar commands operate on archive path/basename.

Now let’s see just what files are saved in archive #2, the incremental:

$ dar_manager --base $DARDB --used 2 [ Saved ][ ] [redacted] [ Saved ][ ] file01-unchanged [ Saved ][ ] cp

Now we can also where a file is stored. Here’s one that was saved in the full backup and unmodified in the incremental:

$ dar_manager --base $DARDB --file [redacted] 1 Fri Jun 16 19:15:12 2023 saved absent 2 Fri Jun 16 19:15:12 2023 present absent

(The absent at the end refers to extended attributes that the file didn’t have)

Similarly, for files that were added or removed, they’ll be listed only at the appropriate place.

Walkthrough: Restoration

I’m not going to repeat the author’s full restoration with dar page, but here are some quick examples.

A simple way of doing everything is using incrementals for the whole series. To do that, you’d have bak1 be full, bak2 based on bak1, bak3 based on bak2, bak4 based on bak3, etc. To restore from such a series, you have two options:

Use dar to simply extract each archive in order. It will handle deletions, renames, etc. along the way.
Use dar_manager with the backup database to do manage the process. It may be somewhat more efficient, as it won’t bother to restore files that will later be modified or deleted.

If you get fancy — for instance, bak2 is based on bak1, bak3 on bak2, bak4 on bak1 — then you would want to use dar_manager to ensure a consistent restore is completed. Either way, the process is nearly identical. Also, I figure, to make things easy, you can save a copy of the entire set of isolated catalogs before you finalize each disc/drive. They’re so small, and this would let someone with just the most recent disc build a dar_manager database without having to go through all the other discs.

Anyhow, let’s do a restore using just dar. I’ll make a $RESTOREDIR and do it that way.

$ dar \ --verbose \ --extract $DRIVE/bak1 \ --fs-root $RESTOREDIR \ --no-warn \ --execute "echo Ready for slice %n. Press Enter; read foo"

This –execute lets us see how dar works; this is an illustration of the power it has (above –pause); it’s a snippet interpreted by /bin/sh with %n being one of the dar placeholders. If memory serves, it’s not strictly necessary, as dar will prompt you for slices it needs if they’re not mounted. Anyhow, you’ll see it first reading the last slice, which contains the catalog, then reading from the beginning.

Here we go:

Auto detecting min-digits to be 2 Opening archive bak1 ... Opening the archive using the multi-slice abstraction layer... Ready for slice 10. Press Enter ... Loading catalogue into memory... Locating archive contents... Reading archive contents... File ownership will not be restored du to the lack of privilege, you can disable this message by asking not to restore file ownership [return = YES | Esc = NO] Continuing... Restoring file's data: [redacted] Restoring file's FSA: [redacted] Ready for slice 1. Press Enter ... Ready for slice 2. Press Enter ... -------------------------------------------- 581 inode(s) restored including 0 hard link(s) 0 inode(s) not restored (not saved in archive) 0 inode(s) not restored (overwriting policy decision) 0 inode(s) ignored (excluded by filters) 0 inode(s) failed to restore (filesystem error) 0 inode(s) deleted -------------------------------------------- Total number of inode(s) considered: 581 -------------------------------------------- EA restored for 0 inode(s) FSA restored for 0 inode(s) --------------------------------------------

The warning is because I’m not doing the extraction as root, which limits dar’s ability to fully restore ownership data.

OK, now the incremental:

$ dar \ --verbose \ --extract $DRIVE/bak2 \ --fs-root $RESTOREDIR \ --no-warn \ --execute "echo Ready for slice %n. Press Enter; read foo" ... Ready for slice 1. Press Enter ... Restoring file's data: /acrypt/no-backup/jgoerzen/dar-testing/restore/file01-unchanged Restoring file's FSA: /acrypt/no-backup/jgoerzen/dar-testing/restore/file01-unchanged Restoring file's data: /acrypt/no-backup/jgoerzen/dar-testing/restore/cp Restoring file's FSA: /acrypt/no-backup/jgoerzen/dar-testing/restore/cp Restoring file's data: /acrypt/no-backup/jgoerzen/dar-testing/restore/[redacted directory] Removing file (reason is file recorded as removed in archive): [redacted file] Removing file (reason is file recorded as removed in archive): [redacted file]

This all looks right! Now how about we compare the restore to the original source directory?

$ diff -durN $SOURCEDIR $RESTOREDIR

No changes – perfect.

We could instead do this restore via a single dar_manager command, though annoyingly, we’d have to pass all top-level files/directories to dar_manager –restore. But still, it’s one command, and basically automates and optimizes the dar restores shown above.

Conclusions

Dar makes it extremely easy to just Do The Right Thing when making archives. One command makes a backup. It saves things in simple files. You can make an isolated catalog if you want, and it too is saved in a simple file. You can query what is in the files and where. You can restore from all or part of the files. You can simply play the backups forward, in order, to achieve a full and consistent restore. Or you can load data about them into dar_manager for an optimized restore.

A bit of scripting will be necessary to make incrementals; finding the most recent backup or catalog. If backup files are named with care — for instance, by date — then this should be a pretty easy task.

I haven’t touched on resiliency yet. dar comes with tools for recovering archives that have had portions corrupted or lost. It can also rebuild the catalog if it is corrupted or lost. It adds “tape marks” (or “escape sequences”) to the archive along with the data stream. So every entry in the catalog is actually stored in the archive twice: once alongside the file data, and once at the end in the collected catalog. This allows dar to scan a corrupted file for the tape marks and reconstruct whatever is still intact, even if the catalog is lost. dar also integrates with tools like sha256sum and par2 to simplify archive integrity testing and restoration.

This balances against the need to use a tool (dar, optionally with a GUI frontend) to restore files. I’ll discuss that more in the next post.

2 thoughts on “Using dar for Data Archiving”

Pingback: 使用Dar进行数据归档 - 偏执的码农

Mentions

💬 John Goerzen

Likes

👍 Simon Michael
👍 Jaakko Niemi
👍 clarkema@mastodon.social
👍 Anke
👍 Al
👍 Nivex 🐧 📻
👍 cathos

Reposts

♻️ HMF
♻️ Willard Goosey

The Changelog

Comments on family, technology, and society

Using dar for Data Archiving

2 thoughts on “Using dar for Data Archiving”

Mentions

Likes

Reposts

Leave a Reply Cancel reply