Saturday, May 16, 2009

Github's Fork Queue

I keep my Perl code in Git, and I've been using Github to host it. Github significantly lowers the entry barrier towards contributing to opensource projects. I really love it so far. But I do have one problem, the only way to apply commits using its fork queue feature is actually the equivalent of running git cherry-pick --signoff.

When you use git cherry-pick it effectively clones the commit by reapplying the diff, and overrides some of the commit metadata. The commit you cherry-picking and the commit you end up applying are become two separate history tracks as far as Git is concerned, even if there could be a fast forward merge.

When your contributer tries to sync up again, they will probably run git pull (I'll post some other time on why I think that's a bad habit), and end up merging their version of the patch with your version of the patch.

This can lead to very confusing history, especially when they want you to pull again and you have a mishmash of patches to sort through. Making things worse, many people who contribute using Github do so because it's so easy. By cherry picking their commits instead of just merging them you are making it hard for them, they will probably have to run git reset --hard origin/master to make things right.

So, in the interest of sanity, unless you actually mean to cherry pick, please always use git merge:


% git remote add some_user git://github.com/some_user/repo.git
% git remote update
Updating origin
Updating some_user
From git://github.com/some_user/repo
 * [new branch]      master   -> some_user/master
% gitx HEAD some_user/master
% git merge some_user/master
Updating decafbad..b00b1es
Fast forward
 lib/Some/File.pm |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

The gitx or gitk invocation lets you compare the two branches, viewing the commits you are about to merge before you actually apply them. I also really like using gitx --all to compare all the branches of a tree. The --left-right option is useful for comparing divergent branches. Note that all of these options are also usable with plain git log.

If you do insist on cherry picking, make sure to tell your contributer what to do afterwords (e.g. git reset --hard origin/master, possibly preceded with git branch rejected_work to keep their rejected patches in their own branch). Otherwise their unapplied work will continue to haunt them as they try to work on the new version.

Hopefully Github will implement a feature in the fork queue allowing you to fast forward to a revision, making it easy to do handle this much more common case of merging.

Lastly, since I can't resist nitpicking I'd also like to add that by convention Signed-off-by means (at least in the context of git and linux) that the committer is making some sort of guarantee about the copyright status of the contribution, either giving rights if they're the author, or claiming that they received copy rights from the contributer whose patch they are applying. If you to interpret the Signed-off-by that way, then the fork queue is essentially forging a signoff from you every time you use it. It could be argued that it's meaningless, but then why do it at all? It seems the intent was really more like an Acked-by line. I think that adding a signoff should only be done if a checkbox that is disabled by default has been ticked by the user.

8 comments:

dustin said...

There's nothing wrong with a cherry-pick, just make sure you rebase on pull instead of merge. This is configurable per branch and you can setup git to automatically enable this setting when branching (see autosetuprebase).

In memcached, our master is *only* updated via cherry-pick now as it takes two people to make a change: one to make the change, and another to commit it. You see the committer and author in the changelog.

nothingmuch said...

I stand corrected: http://gist.github.com/112559 (didn't really believe that'd work ;-)

But this is still problematic because git newbies often don't know they should use pull --rebase or make that the default. I'd like to think I know git quite well, but I wasn't aware that rebase would do the right thing there.

This is the result I tend to see in the wild unfortunately: http://gist.github.com/112560

Thanks for the tip!

nothingmuch said...

Woot, this also works if you reorder commits. It's like darcs all over again =)

nothingmuch said...

Ooh it gets better, rebase mops up after an accidental merge: http://gist.github.com/112566

I love git rebase long time

melo said...

This works because git tracks content, not changes. When you cherry pick, the final content is the same, no matter what path took you there.

But I think the part that is wrong in all of your article is right there at the beginning. The contributor should not use his master branch to code. The first thing he should do is to create a topic branch and work on that.

When he is ready to show his code, he might want to rebase it against the latest master, just to make sure its still compatible with the latest version of the upstream code.

If his topic branch is accepted, it will show up on his master branch, naturally, and then he can just kill off his topic branch.

nothingmuch said...

Actually in this case I think it's the inverse, and that's why I was so surprised. The rebase is also comparing the deltas of the commits (which is why reordering works), not just the resulting commit sha1s or tree sha1s, in order to deduce which commits in the log are in fact meaningless duplicates, allowing out of order cherry-picking work too. If it only compared commit and tree sha1s then it wouldn't be so awesome.

Secondly, I'd like to stress that the problem is not with the software, it's with the people who don't know how to work "correctly" with git. Most of the patches I receive have been done on master, and if I were to cherry pick them the contributer would probably not know to use pull --rebase or to set autosetuprebase in their configuration unless I told them to.

I agree, topic branches are the way to go, but you're preaching to the choir. Github's fork queue is appealing to people who are not comfortable using git on the command line. Unfortunately the defaults it uses are harmful to exactly that demographic. These people don't understand topic branches or git's inherent immutability in the model or that history can be nonlinear, so expecting them to understand the implications of cherry picking and rebasing is a little naive.

melo said...

I don't think that git looks that much attention to the sequence of commits. The only check for sha1's that he does (AFAIK) is to determine if he can fast-forward or not. If not, then he just uses a straight three-way merge between the two trees.

As long as commits A->B->C result in the same tree as C->A->B (or even C+B->A, git doesn't care about the sequence. it merges the content, not the commit sequence.

Thats why its called a content tracker. He worries about end results, not how you got there.

And yes, I understood that this is about best practices and workflow, not about git limitations.

And yes, I can see now that the current fork-queue is not that good for non-power users. I think they need a rebase my master button :).

Jason said...

> And yes, I can see now that the current fork-queue is not that good for non-power users. I think they need a rebase my master button :).

Yeah, a rebase my master button would indeed be awesome.