Mirroring SVN repository to GitHub

20 August 2008

git java jruby subversion

So, I'm gearing up to work on some Java+Ruby (via JRuby) stuff. The Java world still seems fairly entrenched in the cult of Subversion, while the Rubyists have gone with Git lately.

I'm still wrapping my mind around Git, but with GitHub, it's fairly easy and straight-forward. I paid my $7 for the micro account, to give me room to screw around.

There's quite a few posts about mirroring SVN to a Git repository, but I feel the need to add my own, of course.

My goal is mirror the trunk of the JRuby project from Codehaus SVN to my account on GitHub. By doing this, I can track the trunk development, and also work on my own patches.

I started by creating an empty repository on my GitHub account, called 'jruby'.


Now, over on my always-on, Contegix-powered server, I create a brand new local git repository, also called jruby.

mkdir jruby cd jruby git init

Next I use 'git svn init' to setup the SVN repository as a remote code source to track. Using the -T switch points git to the trunk, and ignores branches and tags, which is fine for my purposes.

git svn init -T http://svn.codehaus.org/jruby/trunk/jruby/

That does not pull any code, but it lets my local working tree know that I'm going to be pulling from an SVN repository at some point. This setup only occurs in your local repository, and does not seem to ever get pushed to GitHub once we get to that point.

So, now we do the initial pull. Once again, this is on my always-on, Contegix-powered server, not my local laptop. I'm doing this on a server because towards the end, we'll be setting up a cronjob to accomplish it all.

git svn fetch

It'll think for a while, it'll slurp down the SVN revision history, it'll stop and ponder occasionally, and eventually, it'll be done. Woo-hoo! Our local working tree is now up-to-date with the subversion HEAD as of that moment.

To reduce disk-space used by your local repository, go ahead and run the garbage collector

git gc

On my system, that reduced the space from over 600mb to under 70mb.

Now, that's great, but it's still just on my local repository. Time to push it to GitHub. We're not going to follow their directions exactly, since this will ultimately be a cronjob and needs to use ssh. And I'm slightly paranoid about my ssh keys.

So, the first thing I do is create another keypair, for used only by my mirroring process, and only for pushing changes to github. It has no passphrase. This allows me to keep my top-secret keys off my shared, always-on server. If these keys are compromised, all an attacker can use them for is to push changes to GitHub. Which, being revision-control, is more annoying than dangerous. (Hooray for "git reset").

ssh-keygen -t dsa -f .ssh/id_dsa_github_mirroring

Next, I edit my .ssh/config to add a "fake host" so that ssh connections invoked by git will use this new key.

As with all previous bits, this is still on my always-on server, not my local laptop.

Host githubmirror User git Hostname github.com IdentityFile /home/bob/.ssh/id_dsa_github_mirroring

This will cause any invocation of "ssh githubmirror" into "ssh git@github.com -i .ssh/id_dsa_github_mirroring".

I then installed id_dsa_github_mirroring.pub into my GitHub account.

Now, GitHub's instructions say to run this command to add the GitHub repository as a remote named "origin"

git remote add origin git@github.com:bobmcwhirter/jruby.git

Instead, we teak it to use the "fake host" we added to .ssh/config

git remote add origin git@githubmirror:bobmcwhirter/jruby.git

We're almost done, I promise.

Next, we need to do the first push from my server up to GitHub. We first push to the 'master' branch, since the repo really wants to have a master branch.

git push origin master

Now, GitHub doesn't allow you to fork a repository you own, and since this mirror is owned by me, where can I do my own hacks and patches? The 'master' branch of course. But I still want an unmolested, straight-from-subversion mirror. So, I create a 'vendor' branch in my workspace. It's initialized to match 'master' exactly.

git checkout -b vendor

Now, I push that to GitHub, too.

git push origin vendor

Awesome. I now have two branches, identical at the moment, called "vendor" and "master".

Now, as far as I can tell, all the Subversion setup that we did only lives in the local repository on my always-on server. Anyone who clones from the GitHub repository will not have that stuff. They can of course do a 'git svn init' themselves, to add it to their local repository. But it doesn't flow through GitHub.

But that's fine, since I've been doing this on my always-on server anyhow. My workspace is sitting in the 'vendor' branch that's tracking the vendor branch from github.

I can pull the latest changes from Subversion by typing

git svn rebase

The 'rebase' command is neat, in that any changes that exist in the git repository are floated to be applied to whatever the latest HEAD is. But since I'm only concerned with a one-way SVN-to-Git mirror, there will never be any changes to float, and this will just tack on subsequent SVN commits as Git commits onto the 'vendor' branch. It'll leave the 'master' branch un-touched.

After rebasing, you gotta push the 'vendor' branch up to GitHub.

git push origin vendor

Now, type that every 15 minutes, and your 'vendor' branch will stay mostly up-to-date.

Or use cron.

I've cronned a script that fires every 15 minutes

#!/bin/sh cd /home/bob/github-svn-mirrors/$1 git svn rebase git push origin vendor

It's run with the repository name as the first (and only) argument

*/15 * * * * /home/bob/github-svn-mirrors/bin/mirror jruby

Now, over on my laptop, finally, I can clone the repository, work on topic branches, push to master and have my own controlled environment and fork, while knowing the 'vendor' branch reflects the pure SVN state which I can also pull into my hackings as-desired.

When I submit a patch, if it ultimately floats back to me through the vendor branch, git is supposedly smart enough to realize that the same changes have arrived in my 'master' (assuming it's applied verbatim) and keep things nice and tidy. Else, I can force a merge, trampling my half-assed patch with the official JRuby code.

Unwind with Subversion

16 October 2007

codehaus jbossorg opensource ruby subversion tools

At the Codehaus and at JBoss.org, I've continually come across Subversion repositories that needed to be split apart or merged, perhaps after converting from CVS. One problem you continually hit, particularly if you're merging repositories, is the "date order of revisions" bug. Simply stated, if you create a new repository loaded from two other repositories, you can end up with a situation where revision N does not necessarily occur before revision N+1, in terms of the commit time-stamp.

When you do a date-based operation using Subversion, it does a binary search through the revision sequence to find the revisions matching the specified dates. This binary search assumes the revisions are indeed date-ordered.

With the acquisition of Mobicents by JBoss, we're in the situation of having to merge about a dozen repositories. Some are CVS, some are SVN. Good ol' cvs2svn works well for the first step, of converting a CVS repository into a SVN repository. But now we have either oddly disjoint repositories, or conflicting paths overlaid one another.

I've always been a fan of mod_rewrite for Apache-httpd, and a SVN dumpfile has a lot of paths just ripe for rewriting. 1000 lines of Ruby code later, I'm able to announce Unwind. Unwind is a Ruby library, along with a command-line tool, for performing mind-numbing feats of repository surgery.

Since a massive conversion and rewriting is something that requires a bit of trial-and-error, the command-line utility is ultimately driven by a configuration file. Of course, with Ruby, it's just a DSL created using instance_eval and blocks.

Picture 5.png

This configuration file will ultimate produce a single file (merged-repo.svndump) from multiple input dump files. Each source file can include()/exclude paths (based upon the original paths in that particular dumpfile). Each source can also use Rails-ish URL rewriting. The :something syntax matches 1 segment of a path, and is available as a substitution value in the output path for the rule.

The rewrite engine tracks all existing paths, and creates parent directories when necessary. SVN copy operations are fully adjusted both for the source and the destination paths.

Unwind automatically interleaves revisions to achieve total monotonically increasing time-ordering for the final repository.

Finally, before a revision is emitted to the output repository, addition include()/exclude() rules can be applied. For repositories converted from CVS, you may end up with a bundle of CVSROOT directories attempting to live in the same location. No reason to rewrite them to unique locations, as you can just exclude them before they get figured into the final output repository.

Unwind uses SQLite for organizing the meta-information about each repository and revision, while performing random-access seeks on the source dumpfiles to produce the final repository. While merging may be the common use-case, Unwind's rewriting also makes it useful just for extracting bits out of a repository.

At this point, this blog entry is the complete documentation for Unwind. But feel free to browse the SVN repository.