Monthly Archives: April 2008

Knuth and Reusable Code

In the recent interview with InformIT, Donald Knuth said:

I also must confess to a strong bias against the fashion for reusable code. To me, “re-editable code” is much, much better than an untouchable black box or toolkit. I could go on and on about this. If you’re totally convinced that reusable code is wonderful, I probably won’t be able to sway you anyway, but you’ll never convince me that reusable code isn’t mostly a menace.

I have tried in vain to locate any place where he talks about this topic at greater length. Does anyone have a link?

A Smart Gas Tax

The recent announcements by McCain and Clinton of their support for a temporary repeal of the Federal gas tax make me sick. More on why later, but first, I want to put forth my idea. I think both Republicans and Democrats would like it — as it’s based on market principles and achieves a reduction in costs to the average household, while simultaneously helping the environment and reducing our dependency on foreign oil. But of course, it’s courageous, and we don’t have many politicians of that type anymore.

What we need is a large, revenue-neutral, gas tax increase. Now, before people go nuts, let’s explore what this means.

Revenue-neutral means that it doesn’t result in a net increase of monies going to the government. The increase in the gas tax rate is offset by a decrease in the income tax, tied to the cost of direct and indirect taxable gasoline each family or business consumes. So on day 1, if you cost of filling up at the tank goes up by $10 in a week, if you are an average family, your total paychecks also go up by $10. Your cost for receiving a package might go up by $1, and your paycheck goes up by the same amount. So you’re no worse off than before — if you’re average.

Let’s look at the pros and cons of this sort of plan:

  • The economic incentive to be efficient consumers of gas is magnified. This will eventually lead to Americans having more money in their pockets, increasing market incentives for fuel efficiency, and a decreasing (or increasing slower) price of oil as demand slows.
  • Economic incentives to use mass transit, live close to urban centers, or drive fuel-efficient vehicles are magnified. Likewise, the economic incentives to invest in mass transit and efficient automobiles are also magnified.
  • As more efficient technologies come on the market, and Americans decide that they’d like to pad their bank accounts by hundreds or thousands of dollars a year, more sustainable and environmentally-friendly development patterns will emerge. Also, the price of oil will be kept low. Of course, people that choose not to change will, on average, be no worse off than before.
  • Alternative choices to the automobile will have a greater incentive to develop. Think the return of a fast, speedy national passenger and freight network, greater mass transit options, etc.
  • The marketplace will drive Detroit to love making fuel-efficient vehicles, because they will be the new profit centers.
  • This sort of thing is known to work well in other countries around the world.

If we think more long-term, we see even more positive effects:

  • The return to local agriculture and manufacturing. Due to lower transportation costs, local farmers and manufacturers will be able to undercut Walmart’s prices due to the larger relative costs of Walmart’s much-vaunted national distribution network. Unless, that is, Walmart starts buying local — which is a good thing too. This is a good thing for American jobs.
  • Keeping all that oil money in the domestic economy is a good thing for American jobs, too.
  • Our businesses will have a jump start on being competitive in the increasingly carbon-regulated global marketplace.

As for the cons:

  • Eventually this will lead to a net reduction in Federal revenues as efficiencies develop in the marketplace and people save money on gas. Corresponding budget cuts will be required. (A good thing, I figure)
  • Implementing this all at once would be a shock to some people living inefficiently now — those that are far above average. It would have to be implemented gradually to avoid being a shock to the economy.

Now, for the McCain/Clinton plan: it’s a farce. Reducing the gas taxes means more efficient gas, which means more consumption of gas, which in turn leads to — yes — higher gas prices. Its real effect will be minimal, and is a terrible long-term policy. It charges tens of billions of dollars to the national credit card (which we, and our children, will have to repay) while achieving almost no benefit now. It’s a gimmick through and through, and something that says loud and clear that neither candidate is on track for the “Straight Talk Express”.

Update 4/29/2008: One potential solution for the problem of declining revenues over time is to periodically re-index the averages to mirror current usage. Assuming this does really lead to the expected drop in consumption, there is no sense in 2020 of paying people for how much gas they would have used in 2008.

datapacker

Every so often, I come across some utility that need. I think it must have been written before, but I can’t find it.

Today I needed a tool to take a set of files and split them up into directories in a size that will fit on DVDs. I wanted a tool that could either produce the minimum number of DVDs, or keep the files in order. I couldn’t find one. So I wrote datapacker.

datapacker is a tool to group files by size. It is perhaps most often used to fit a set of files onto the minimum number of CDs or DVDs.

datapacker is designed to group files such that they fill fixed-size containers (called “bins”) using the minimum number of containers. This is useful, for instance, if you want to archive a number of files to CD or DVD, and want to organize them such that you use the minimum possible number of CDs or DVDs.

In many cases, datapacker executes almost instantaneously. Of particular note, the hardlink action can be used to effectively copy data into bins without having to actually copy the data at all.

datapacker is a tool in the traditional Unix style; it can be used in pipes and call other tools.

I have, of course, uploaded it to sid. But while it sits in NEW, you can download the source tarball (with debian/ directory) from the project homepage at http://software.complete.org/datapacker. I’ve also got an HTML version of the manpage online, so you can see all the cool features of datapacker. It works nicely with find, xargs, mkisofs, and any other Unixy pipe-friendly program.

Those of you that know me will not be surprised that I wrote datapacker in Haskell. For this project, I added a bin-packing module and support for parsing inputs like 1.5g to MissingH. So everyone else that needs to do that sort of thing can now use library functions for it.

Update… I should have mentioned the really cool thing about this. After datapacker compiled and ran, I had only one mistake that was not caught by the Haskell compiler: I said < where I should have said <= one place. This is one of the very nice things about Haskell: the language lends itself to compilers that can catch so much. It’s not that I’m a perfect programmer, just that my compiler is pretty crafty.

Backup Software

I think most people reading my blog would agree that backups are extremely important. So much important data is on computers these days: family photos, emails, financial records. So I take backups seriously.

A little while back, I purchased two identical 400GB external hard disks. One is kept at home, and the other at a safe deposit box in a bank in a different town. Every week or two, I swap drives, so that neither one ever becomes too dated. This process is relatively inexpensive (safe deposit boxes big enough to hold the drive go for $25/year), and works well.

I have been using rdiff-backup to make these backups for several years now. (Since at least 2004, when I submitted a patch to make it record all metadata on MacOS X). rdiff-backup is quite nice. It is designed for storage to a hard disk. It stores on the disk a current filesystem mirror along with some metadata files that include permissions information. History is achieved by storing compressed rdiff (rsync) deltas going backwards in time. So restoring “most recent” files is a simple copy plus application of metadata, and restoring older files means reversing history. rdiff-backup does both automatically.

This is a nice system and has served me well for quite some time. But it has its drawbacks. One is that you always have to have the current image, uncompressed, which uses up lots of space. Another is that you can’t encrypt these backups with something like gpg for storage on a potentially untrusted hosting service (say, rsync.net). Also, when your backup disk fills up, it takes forever to figure out what to delete, since rdiff-backup –list-increment-sizes must stat tens of thousands of files. So I went looking for alternatives.

The author of rdiff-backup actually wrote one, called duplicity. Duplicity works by, essentially, storing a tarball full backup with its rdiff signature, then storing tarballs of rdiff deltas going forward in time. The reason rdiff-backup must have the full mirror is that it must generate rdiff deltas “backwards”, which requires the full prior file available. Duplicity works around this.

However, the problem with duplicity is that if the full backup gets lost or corrupted, nothing newer than it can be restored. You must make new full backups periodically so that you can remove the old history. The other big problem with duplicity is that it doesn’t grok hard links at all. That makes it unsuitable for backing up /sbin, /bin, /usr, and my /home, in which I frequently use hard links for preparing CD images, linking DVCS branches, etc.

So I went off searching out other projects and thinking about the problem myself.

One potential solution is to simply store tarballs and rdiff deltas going forward. That would require performing an entire full backup every day, which probably isn’t a problem for me now, but I worry about the load that will place on my hard disks and the additional power it would consume to process all that data.

So what other projects are out there? Two caught my attention. The first is Box Backup. It is similar in concept to rdiff-backup. It has its own archive format, and otherwise operates on a similar principle to rdiff-backup. It stores the most recent data in its archive format, compressed, along with the signatures for it. Then it generates reverse deltas similar to rdiff-backup. It supports encryption out of the box, too. It sounded like a perfect solution. Then I realized it doesn’t store hard links, device entries, etc., and has a design flaw that causes it to miss some changes to config files in /etc on Gentoo. That’s a real bummer, because it sounded so nice otherwise. But I just can’t trust my system to a program where I have to be careful not to use certain OS features because they won’t be backed up right.

The other interesting one is dar, the Disk ARchive tool, described by its author as the great grandson of tar — and a pretty legitimate claim at that. Traditionally, if you are going to back up a Unix box, you have to choose between two not-quite-perfect options. You could use something like tar, which backs up all your permissions, special files, hard links, etc, but doesn’t support random access. So to extract just one file, tar will read through the 5GB before it in the archive. Or you could use zip, which doesn’t handle all the special stuff, but does support random access. Over the years, many backup systems have improved upon this in various ways. Bacula, for instance, is incredibly fast for tapes as it creates new tape “files” every so often and stores the precise tape location of each file in its database.

But none seem quite as nice as dar for disk backups. In addition to supporting all the special stuff out there, dar sports built-in compression and encryption. Unlike tar, compression is applied per-file, and encryption is applied per 10K block, which is really slick. This allows you to extract one file without having to decrypt and decompress the entire archive. dar also maintains a catalog which permits random access, has built-in support for splitting archives across removable media like CD-Rs, has a nice incremental backup feature, and sports a host of tools for tweaking archives — removing files from them, changing compression schemes, etc.

But dar does not use binary deltas. I thought this would be quite space-inefficient, so I decided I would put it to the test, against a real-world scenario that would probably be pretty much a worst case scenario for it and a best case for rdiff-backup.

I track Debian sid and haven’t updated my home box in quite some time. I have over 1GB of .debs downloaded which represent updates. Many of these updates are going to touch tons of files in /usr, though often making small changes, or even none at all. Sounds like rdiff-backup heaven, right?

I ran rdiff-backup to a clean area before applying any updates, and used dar to create a full backup file of the same data. Then I ran apt-get upgrade, and made incrementals with both rdiff-backup and dar. Finally I ran apt-get dist-upgrade, and did the same thing. So I have three backups with each system.

Let’s look at how rdiff-backup did first.

According to rdiff-backup –list-increment-sizes, my /usr backup looks like this:

        Time                       Size        Cumulative size
-----------------------------------------------------------------------------
Sun Apr 13 18:37:56 2008         5.15 GB           5.15 GB   (current mirror)
Sun Apr 13 08:51:30 2008          405 MB           5.54 GB
Sun Apr 13 03:08:07 2008          471 MB           6.00 GB

So what we see here is that we’re using 5.15GB for the mirror of the current state of /usr. The delta between the old state of /usr and the state after apt-get upgrade was 471MB, and the delta representing dist-upgrade was 405MB, for total disk consumption of 6GB.

But if I run du -s over the /usr storage area in rdiff, it says that 7.0GB was used. du -s –apparent-size shows 6.1GB. The difference is that all the tens of thousands of files each waste some space at the end of their blocks, and that adds up to an entire gigabyte. rdiff-backup effectively consumed 7.0GB of space.

Now, for dar:

-rw-r--r-- 1 root root 2.3G Apr 12 22:47 usr-l00.1.dar
-rw-r--r-- 1 root root 826M Apr 13 11:34 usr-l01.1.dar
-rw-r--r-- 1 root root 411M Apr 13 19:05 usr-l02.1.dar

This was using bzip2 compression, and backed up the exact same files and data that rdiff-backup did. The initial mirror was 2.3GB, much smaller than the 5.1GB that rdiff-backup consumes. The apt-get upgrade differential was 826MB compared to the 471MB in rdiff-backup — not really a surprise. But the dist-upgrade differential — still a pathologically bad case for dar, but less so — was only 6MB larger than the 405MB rdiff-backup case. And the total actual disk consumption of dar was only 3.5GB — half the 7.0GB rdiff-backup claimed!

I still expect that, over an extended time, rdiff-backup could chip away at dar’s lead… or maybe not, if lots of small files change.

But this was a completely unexpected result. I am definitely going to give dar a closer look.

Also, before I started all this, I converted my external hard disk from ext3 to XFS because of ext3’s terrible performance with rdiff-backup.

Pennsylvania and Irrelevance

NPR has been doing an interesting series this week. They’ve sent out a reporter who is going all across Pennsylvania interviewing people at local food markets. He found a fish shop in Pittsburgh, a market in Lancaster, and some shops in Philadelphia. He sought out Democratic voters to ask them about their thoughts on Clinton vs. Obama.

A lot of the Pennsylvania voters were for Clinton. When asked why, most of them said that they liked Bill Clinton and his policies. A few said they liked how Hillary handled the Lewinsky affair. To me, none of that has anything to do with whether Clinton or Obama would be better for the country.

Then there was the person this morning who was criticizing Obama for not offering specifics. She said she is Jewish, and so Israel is important to her, and Obama hasn’t said anything about helping along the peace process. So I went to barackobama.com, clicked Enter the Site, went to Issues, Foreign Policy, then Israel. Then I clicked on the full fact sheet, which was a full 2 pages on Israel, including far more detail than the voter said she wanted.

I often wonder about these people that say Obama doesn’t have specifics. Just because each speech doesn’t read off a whole lot of information doesn’t mean that he doesn’t have it — it’s all there on the website. I’m sure people that don’t have Internet access could call the Obama campaign and get information, too. It seems Obama ought to do a better job of mentioning this fact at every possible opportunity.

Then I hear a lot of Clinton supporters saying that since Clinton has won states like Ohio in the primaries, she’d do better there in the general election. I think that is a totally facetious argument. Just because Clinton did better with Democrats doesn’t mean that she’d do better in the general election. We can generally assume that the Democratic voters will vote for the Democratic nominee, whoever it is. The question is how many independents and Republicans a person can win over.