February 23rd, 2007
Today I was musing about different version control systems and merge algorithms. I’ve been thinking specifically about how I maintain Debian packages in Darcs. I tend to import upstream tarballs into one branch, and maintain the Debian packages in another, simply merging when a new upstream is released.
Now, there seem to be two prevailing philosophies on how to handle merges in this case. I’m thinking here about merges back to upstream. Say I want to contribute my Debian patches to them.
- Commit “clean” patches upstream. Don’t have a bunch of history — the fixing typos commits, the fixing bugs commits, or the merging to track new upstream releases. Just something like a series of diffs against the current head.
- Bring across the full history, warts and all, and keep it around permanently.
git encourages option #1, with its rebase option. Darcs encourages option #2 (though some use its amend-record option to work more like #1).
As I got to thinking about it, it occured to me that git-rebase would be very nice if you are going to use philosophy #1. In short, rebase will remove your local patches from a repo, update it to the latest upstream, then re-apply your local changesets — aborting to have you fix any conflicts. This is as opposed to a more traditional merge, where you add the upstream changesets to your local branch and then commit new changesets to resolve conflicts. (So a rebase would be totally useless in situation #2)
I got to thinking about this, and started wondering what would happen to people that I’m working with that in turn work off my branches. And sure enough, the git-rebase manpage says, “When you rebase a branch, you are changing its history in a way that will cause problems for anyone who already has a copy of the branch in their repository and tries to pull updates from you.”
I maintain, therefore, that git-rebase is evil and should be avoided. It only works for a situation where someone maintains a private branch of a project, never shared in any way except to submit patches to an upstream. Forget it if you have a team maintaining that branch, or want to post that branch online for others to help with (as I do with my Debian darcs package). Even if you keep it private now, do you really want to adopt a work process that forces you to keep it private forever, or else completely change how you work?
And this brings me back to the original question of patch philosophy. Personally, I dislike philosophy #1. I’d much rather have the full history of a change, warts and all. Look at the Linux kernel example: changesets that introduced bugs that made it into the official tree have their fixes documented, but changesets that introduced bugs that were fixed before being merged into the official tree could be lost to the public due to rebasing by submitters. Is that really what we want? I don’t think so.
With Darcs, tagging is very cheap and it is quite trivial to write an “apply a changeset bundle” script that makes a before tag, applies a series of patches, and makes an after tag. One could then run a darcs diff between the two tags to see the net effect on the repository, or could still look at the individual patches. (Or, you can avoid tagging and manually specify the “from” and “to” patches.) I find that a much better model: you can have it both ways. I’d think that most modern VCSs ought to support some variant on that, too.
And I think that git-rebase should be removed on the grounds that it encourages poor version tracking practices.