Friday, June 26, 2009

App::Prove::State rocks

Running a standard Perl module's test suite is very easy using prove. If the module needs compilation, just run make before hand and pass the -b flag to prove, otherwise use -l to add lib to @INC.

If you pass --state=save then information about the last run will be saved in a .prove file in the current directory. This data can be used later to run only failed tests, or to reorder tests.

My testing workflow was greatly improved using this feature. I use two shell aliases to run prove:

alias ts="prove --state=slow,save -j3 -r"
alias tf="prove --state=failed,save -v -r"

I start by running ts -l t (or ts -b t if this module requires make to run before testing). The -r flag tells it search for .t files recursively in t. The -j option enables 3 concurrent jobs, and the --state flag orders tests from slowest to fastest, and causes the .prove file to be updated at the end of the run.

When I get failing tests I switch to using tf -l or tf -b. Since this has --state=failed it only runs tests that failed in the previous run. The -v causes the TAP output to be printed (with coloring).

I fix my tests or code one by one, and keep running tf. When tf has nothing left to run everything is passing, so I switch back to ts to run the full test suite again.

These two simple aliases let me run the test suite much more easily and effectively than make test, so a great many thanks to Ovid and Andy Armstrong [Update: and Andy Lester] for writing and maintaining these tools.

Wednesday, June 24, 2009

YAPC::NA

So YAPC::NA is over, or at least the conference proper is (there are still a few days of hacking to look forward to). It was a pretty bad YAPC for me, I was sick the whole time and didn't really see any talks and barely socialized. I do feel better now so hopefully I can still salvage some fun.

I am also a little disappointed in my own talk as well. I was hoping to be a bit more practical, but every time I tried to write slides with that in mind I got lost in endless digressions. In the end I think it was a little too shallow and ranty, not exactly what I had hoped for. It did seem to be rather well received so at least I'm happy about that.

I also don't think I presented well, but given that I feeling very bad I guess I can't complain. By the end I was having tunnel vision and my ears felt really pressurized. Someone said I sounded like I was bored with my talk, so I guess it didn't go as bad as I felt ;-)

Anyway, here's hoping for a productive hackathon, and a quick recovery.

Monday, June 22, 2009

Persisting Closures

A few weeks ago Bruno Vecchi submitted failing tests that serialize code references to KiokuDB. I finally got around to trying to make these tests pass on my flight to YAPC|10. Long story short, KiokuDB now serializes closures, including managing shared closure variables between several code references.

Serialization is relatively trivial. If you're not familiar with PadWalker, it's a module that lets you introspect the lexical pads of subroutines, including extracting references to all the closed over variables using the closed_over function.

The code itself is deparsed to text using B::Deparse, and the hash returned by closed_over is serialized along side it, as normal making sure that shared references are handled consistently. Any closure data that can be handled by KiokuDB will work normally.

Deserialization was slightly trickier, PadWalker needed additional functionality to actually assign to the subroutines' pads.

First KiokuDB compiles the body of the code while creating an empty lexical environment:

my $vars = join(", ", keys %$pad);

eval qq{
    my ( $vars );
    sub $body;
};

Then the new set_closed_over function added to PadWalker goes through the pad and aliases any values from the provided hash into the lexical scope of this subroutine.

This function goes through the CV's PADLIST field, and simply aliases the referents into the appropriate slots in the pad array.

On the one hand it's very reassuring that when you want to do this sort of thing in Perl you almost always can. What's slightly disappointing is that you do have to drop to C for these stuff.

My first version used the B module to introspect the structures, but this API is simply not powerful enough to allow modification of the data with aliasing semantics.

Thankfully the incredible amount of work in recent years on Perl 5 seems to be inspiring some new directions in Perl 5's core, and while the C API is pretty cumbersome, it does let you do almost anything you want if you want it hard enough.

Tuesday, June 16, 2009

Users, Accounts, Identities and Roles

In my previous post on modeling identity I showed how to decouple identities from accounts. Here is a more in depth look at how and when to separate all the pieces of a user's data from one another.

What is a user?

Applications typically have an object that models the user, but strictly speaking this is not really the user. The user is the human operating the user agent that interacts with the application.

The object approximating the human is a number of different things:

  • An identifier the human uses to authenticate
  • A collection of the user's data
  • An entity that acts on behalf of the user when interacting with the system

What's important to realize is that these things are actually separate parts of what the concept of a user means the system. Even if you roll all of them into a single class you should be able to make the distinction, and be able to break them out into separate objects should the need arise.

Whenever the representation of users in your database is too limiting, it's usually a direct consequence of the different aspects of the users' representation being lumped into a single user entity.

Refined concepts

Having a language with which to talk about a problem space is a vital tool for understanding and solving a problem. Here are some terms for this domain:

User
The human using the application in meatspace
User agent
The software through which the user interacts with the application
Identity
A verifiable identifier for the user (e.g. a username)
Account
Concrete data owned by the user (e.g. a user profile or user authored data)
Actor
An entity with a goal in the system; something that performs operations on the data
Role
A set of behaviors assumed by an actor

There is significant overlap in these terms:

For the duration of a session, user agent is synonymous with the user. Humans aren't made of software, so that's as close as the application will get. But this relationship is transient, the same user could use another user agent at a later point.

For the application to treat the user agent as if it were the user it must prove an identity. Typically the identity is a username and a shared secret is used for verification, demonstrating that the meat operating the user agent is who it claims to be.

The system stores any information about the user in the user's account data.

All actions in the system are carried out by an actor. In order to perform an action this actor assumes a role.

The tricky part is that the word "user" can be used to vaguely describe every one of these terms, usually without problems. However, when there is a need for a distinction this causes confusion.

So When does "user" break down? Whenever you need to have multiple instances of these entities, associated with a single human.

Roles

Roles solve the problem of needing separate access levels for different users.

A naive solution for multiple access levels involves an access level number (or even just a super user flag) that assigns additional privileges in the system, but it's very hard to change the access controls or to add finer grained permissions. If you assign each permission to each user separately this data quickly becomes unmaintainable as well.

Role based access control is a well understood solution to this problem. Permissions are assigned to roles, and roles, in turn, are assigned to users. I won't explain RBAC, it's far too broad a topic (Google is your friend). The point I'm trying to make is that this useful abstraction relies on decoupling the notion of who the user is from what the user is allowed to do.

Taking a page from capability based systems, instead of using roles as simple data you could even use them as the actual entry point to privileged operations. If the user doesn't have access to the role then sensitive operations are not blocked, they are simply inaccessible.

This is a compelling case for making roles a first class entity in the system. Most applications won't need such flexible authorization, but for the applications that do this is a powerful abstraction.

Suppose you have 3 roles: Guest, Unprivileged, and Administrator, these roles can be associated with very fine grained permissions. User agents representing unidentified users only have access to the Guest role, but logged in users can perform operations enabled by any other roles they've been granted.

If Administrator is needs to be split up into Moderator and Superuser, most of the code remains unchanged, and the operation that updates the existing user data is simple too, so making this change go live is relatively low risk.

Identities

Supporting multiple identities is another fairly well understood problem, at least recently. Many applications are starting to support OpenID authentication, and usually existing users expect to be able to associate their existing account with a new ID.

Another example is when Yahoo bought Flickr. Flickr suddenly had two keyspaces for identities, Yahoo IDs, and Flickr usernames, both of which mapped to a single account keyspace.

If you model identities as standalone data that references a user (as opposed to being a primary or surrogate key of the user) then you can easily integrate orthogonal authentication mechanisms like RPX, SSL certificates, or just simple usernames/passwords, without polluting any code except the authentication layer. The rest of the code can stay concerned with only the user account the IDs map to.

Actors

In this case I'm referring to actors in the UML sense, not the actor model sense.

I don't have a positive example for multiple actor support, but there are plenty of negative ones.

First and foremost is Google Apps. I've often wished I could access my work email/calendar from my personal account. Since my personal account will likely outlive my work account I don't want to use it for work or vice versa, but I still need to access the data from both. Fortunately for me email forwarding takes care of pretty much everything, but it's not a real solution.

Instead of being able to quickly switch contexts, I need to copy all the data into one place. Google knows I have access to my work email through forwarding so it lets me pretend I am sending email from there when I'm using my personal account, but there's still no segmentation.

Linode is another example. In order to manage work vs. non work machines I need to log out and log back in using a different account.

Hypothetically, the two actors that would act on my behalf in a these examples would be "Yuval the employee of Infinity Interactive" and "Yuval the unique snowflake". These are both extensions of my person, but they serve different purposes.

While an identity identifies the user from the outside, an actor identifies a user inside the system.

Accounts

In a simple system there usually isn't a strong case for multiple accounts, except when consolidating two accounts (a rare occurrence). Multiple accounts are usually an artifact of not supporting multiple actors.

However, if privacy or segmentation is a needed (suppose I would like to keep my personal data hidden from my colleagues) it becomes much more relevant. Unlike the other abstractions making this distinction is the responsibility of the user, not the application, so this is a niche feature.

Websites like MyOpenID or Plaxo support contextual profiles ("personas" in MyOpenID), allowing you to use separate data in separate contexts. This masquerading ability is an important feature in the services they provide due to the sensitivity of the data, but would be useful in many other sites, especially social networks.

Multiple accounts also facilitate shared accounts. A good example is twitter accounts for projects. These accounts don't represent a single human (usually the humans operating them have separate accounts), and the limitations are pretty painful:

  • The password is shared by all the collaborators
  • Another user agent must be used (or the user's personal session must be terminated) to operate the shared account

Lightweight context switching would solve this. A single identity could be used to authenticate the user to the application, and access all the accounts the user should have access to.

Accounts are closely related to actors. While accounts encapsulate the data belonging to user, actors define how a user interacts with other data.

Tying it all together

So here is my "mega ultimate" model for user data. This is contained in a single transient (session bound) user agent object, with multiple delegates.

These delegates are all broken out of the single user object. What remains of this object is a thin shell that unifies all of the delegates for a single user. This object represents the actual human, and all of the different objects they have access to.

The user object is static, it serves as the hub through which all other objects are associated. The user agent object is dynamic, it defines what is known about a user for the duration of a session.

  • The user agent object lives inside the session data.
  • Every "login" action creates an authentication token.
  • This authentication token marks an identity as verified in the user agent.
  • An identity is associated with the central user object.
  • A user object has access to multiple accounts.
  • Each account has any number of actors through which the user interacts with data outside of the account.
  • Each actor has of any number roles through which the it performs operations on that data.

The object representing the user agent needs to pick the appropriate delegate for each operation, and in most cases there is one obvious delegate to use.

Here is a diagram for the objects modeling an authenticated user:

diagram representing an authenticated user agent

The user agent object is responsible for picking the "active" instance of each of the multiple possible values.

In this case the user has a single user account. This account has access to two different data contexts through two actors, Work and Personal. There are only two roles in this system, Unprivileged which doesn't require authentication, and Known User which is just a default role for authenticated users. The user has registered both a username an OpenID.

The data in the dashed box has a lifetime of a single session, and represents the current set of behaviors the user is assuming, while the data on the right is persistent and represents anything the user can be.

The user signed in using the username, and is currently operating in the Work context. The Work actor's assigned roles are currently active.

Conversely, an unauthenticated user agent only has the guest role active, and makes no user data available:

diagram representing a guest user agent

A sign in action upgrades the guest user agent to an authenticated one by adding the auth token, which enables access to the full user data.

All activity that is made possible by the Unprivileged role is available to both unauthenticated and authenticated users, but actions that require a signed in user must be performed through the Known User role.

Here are some of the of the possibilities of this overengineering marvel:

  • Logging into the same account with different credentials.
  • Allowing an administrator to act as another user.
  • Accessing different data contexts (e.g. work vs. personal).
  • Shared accounts with proper segmentation of private data.
  • Fine grained privacy controls.
  • Fine grained, extensible permissions model.
  • Account consolidation process is obvious.
  • Stable internal identifiers.

These features aren't easy to implement with a single canonical user object whose ID is its primary key, and where all the data is stored as simple attributes of this objects.

YAGNI

Forget the "ultimate" model. It's overkill.

What's important is to merely understand the distinction between the different concepts. Most of the time you don't actually need to implement this distinction, a single user object will suffice.

Just make sure you know when and where to break things down into smaller bits, and don't do things that will prevent you from refactoring this in the future. When you need to implement one of those tricky features, avoid ad-hoc workarounds. In the long run, without having clear separation of concepts you're likely to end up with denormalized user data and inconsistent behavior.

Using duck typing (or roles or interfaces) you can simply treat your user object as an account/role/actor/identity object in the relevant parts of your code. As long as you keep the separation of the concepts clear in the usage, making the switch towards a separate first class object in the actual representation is a simple matter of refactoring.

The tradeoff of creating more abstraction layers provides, as always, flexibility at the cost of complexity. Often times we resort to inferior workarounds because they seem simpler, when in truth they are just dumbing down the problem. KISS is not a synonym for "half assed".

Catalyst

Apart from a few nits these concepts map pretty well Catalyst applications, using the existing authentication framework.

Using multiple distinct identities might require a bit of magic, as most user stores assume a user is an identity, instead of a user having multiple identity delegates.

By default you can still have multiple identities per user, but the user to identity relationship is-a, instead of has-a. The object that ends up in $c->user must also be the object negotiating the authentication sequence.

If you just treat each Catalyst level user object as an identity, and have the "real" user object become a delegate of that. The problem here is that of naming: $c->user->user can get pretty confusing when it really means $c->authenticated_identity->user.

Alternatively, the realm object can traverse the identity objects, and work with the "central" user object directly. In this approach $c->user is the user agent object.

A fully fledged user agent object would require custom work, as it must be tailored to your application's model of user data.

Secondly, the RBAC plugin provides a controller level predicate check for roles. The roles I've been describing in this post are more involved. Each user has role instances that are actual objects in the model. The model is manipulated by calling methods on these objects. The low level operations are still implemented by the corresponding model objects, but the controller actions invoked by the user don't access them directly, they are proxied by the roles.

Obviously the RBAC plugin can still check for the existence of these roles in order to generate simpler access violation errors earlier in the control flow, but it's no longer required to enforce the roles, enforcing is done using a safer capabilities inspired approach.

Finally, actors and accounts are purely application specific abstractions. If required, they are your responsibility to implement in the model, not the responsibility of a plugin or framework.

The End

I'd like to thank Chris Prather for his input and for inspiring this article in the first place by patientlty listening to my ranting as he was trying to rewrite the authentication subsystem of his IRC bots.

Saturday, June 13, 2009

Git, S3 and RewriteMap

I've written a tool to upload files to S3 using content addressable semantics, where the S3 key is a hash of the data. This is a very natural fit for exporting a Git repository, so obviously that is well supported ;-)

My problem

I host my website on a puny cable modem from a linux box that's been in my parents' living room since I was in highschool. The bandwidth sucks but I like the convenience of having shell access. If I was setting it up now I would probably just get a linode, but this already works so until the hardware dies I'm not going to fix what isn't broken.

The website is just static files stored in Git. I didn't want to encumber the HTML with hashed URIs that I would need to keep updated whenever I changed the images. I also wanted the HTML to be viewable locally as I make changes.

The only files I wanted offloaded to S3 were the largish images. The key is that the offloading to S3 had to be non-intrusive, or it wouldn't be worth the effort.

Git to S3

The first step was getting the content onto S3 from Git. For this I've written two Perl modules. Net::Amazon::S3::CAS does the heavy lifting of synchronizing a set of blobs to S3, while Net::Amazon::S3::CAS::Git provides the Git integration. Net::Amazon::S3::CAS also supports uploading blobs from other sources, such as the file system, it's not just Git specific.

Net::Amazon::S3::CAS::Git provides a git-to-s3 command which makes exporting from Git to S3 easy. It uses git-ls-tree internally. Here's an example setup:

git to-s3 --aws-secret-access-key foo \
          --aws-access-key-id bar \
          --bucket blobs.woobling.org \
          --prefix nothingmuch.woobling.org/ \
          --prune --vhost \
          $( git ls-tree -r --name-only HEAD | grep -i '(png|jpg)$' )

In this example the ls-tree command is used to generate patterns to pass back to ls-tree (all the files that appear to be images). In the future I'll probably add MIME type and file size based filtering.

Net::Amazon::S3::CAS then uploads an S3 key based on the blob hash of each image. For example this image:

% git ls-tree HEAD bg.jpg
100644 blob ae5684b40b111e70c2dd4d69f498ddcbf4ff78dd bg.jpg

is uploaded to S3 as http://blobs.woobling.org/nothingmuch.woobling.org/ae5684b40b111e70c2dd4d69f498ddcbf4ff78dd.

Since the URIs contain a hash of the contents, the Expires and Cache-Control are set pretty aggressively (10 years by deafult, with public cache permissions). Similarly, blobs are skipped during uploading if they already exist in the bucket.

RewriteMap

The git-to-s3 script prints a map to STDOUT which can be used in Apache using the RewriteMap directive. Save the output in a file, and then add this to your Apache config:

RewriteEngine On
RewriteMap s3 txt:/path/to/rewritemap.txt
RewriteCond ${s3:$1} !=""
RewriteRule ^/(.*)$ ${s3:$1} [R]

The RewriteCond ensures the RewriteRule only applies to URIs that have entries in the map.

With this in place http://nothingmuch.woobling.org/bg.jpg now properly redirects to S3.

Hopefully S3 will support creating keys that give back 302 status codes at some point in the future, making this step unnecessary.

Git Hooks

Using post-receive hooks as described by Abhijit Menon-Sen you can automate this process. I won't go into the details he covers, since all I did was add the above to-s3 call to my post-receive.

This setup gives me the simplicity of editing static content, and all I need to do is run git push to update my live site, with automated uploading of the heavy files to S3.

Thursday, June 11, 2009

BerkeleyDB::Manager

Another module I haven't talked about is BerkeleyDB::Manager, a convenience wrapper for BerkeleyDB.

The interface that BerkeleyDB exposes to Perl is pretty close to the C API, which can get very tedious.

For me the hardest part of using Berkeley DB was getting everything properly set up. For example, in order to get transaction support you must pass a number of flags to initialize the various subsystems correctly, open the database with the right options, create transactions using the environment, assign them to the database handles you want them to apply to, and then commit or rollback while checking errors everywhere. Compared to DBI this is torture!

All the configuration is done by twiddling bits using flag constants, and every call must be checked for errors manually. For instance, the proper way to atomically increment a value ($db{$key}++) is:

# create an environment home directory manually
mkdir $home || die $!;

# instantiate a properly configured environment
my $env = BerkeleyDB::Env->new(
    -Home   => $home,
    -Flags  => DB_INIT_TXN|DB_INIT_LOG|DB_INIT_LOCK|DB_INIT_MPOOL|DB_CREATE,
) || die $BerkeleyDB::Error;

my $txn = $env->txn_begin || die $BerkeleyDB::Error;

# open a database using the environment
my $db = BerkeleyDB::Btree->new(
    -Env      => $env,
    -Filename => $db_name,
    -Flags    => DB_CREATE|DB_AUTO_COMMIT,
) || die $BerkeleyDB::Error;

# activate the transaction for a database handle
$txn->Txn($db);

# get a value
my $value;

if ( ( my $ret = $db->db_get( $key, $value ) ) != 0 ) {
    if ( $ret != DB_NOTFOUND ) {
        die $BerkeleyDB::Error;
    }
}

# update it atomically
if ( $db->db_put( $key, $value + 1 ) != 0 ) {
    die $BerkeleyDB::Error;
}

if ( $txn->txn_commit != 0 ) {
    die $BerkeleyDB::Error;
}

Berkeley DB is designed to be used on anything from embedded devices to replicated clusters storing terrabytes of data. The price we pay for this is too many knobs. It supports a very large number of features and most of them are optional and independent of each other: journalling, transactions with varying levels of ACID guarantees, locking concurrency, multiversioning concurrency, threading support, multiprocess support and replication to name the big ones.

However this tradeoff doesn't need to be made every single time, and many of the knobs don't even apply to the Perl, either because it isn't exposed in the Perl bindings or it just doesn't make sense in that environment.

BerkeleyDB::Manager is an object oriented wrapper for Berkeley DB's environment handles. It's basically a factory for database handles.

Out of the box it is configured with that I've come to expect from RDBMSs:

  • Multiprocess safe (locking is enabled by default)
  • Transactions, with autocommit
  • Automatic recovery on startup
  • Deadlock detection

All of these options are configurable as attributes, so you can tweak them if necessary. There are also a few options which are disabled by default (such as log_auto_remove).

BerkeleyDB::Manager also provides convenience wrappers, so the above code would looks something like:

my $manager = BerkeleyDB::Manager->new(
    home   => $path,
    create => 1,
);

my $db = $manager->open_db( $db_name );

$manager->txn_do(sub {
    my $value;

    if ( ( my $ret = $db->db_get( $key, $value ) ) != 0 ) {
        if ( $ret != DB_NOTFOUND ) {
            die $BerkeleyDB::Error;
        }
    }

    if ( $db->db_put( $key, $value + 1 ) != 0 ) {
        die $BerkeleyDB::Error;
    }
});

There is no convenience API for put or get since that would require wrapping everything (cursors and database handles). This might be added as an option later, but even though this part of the BDB API is definitely tedius, at least it's not hard to get right, so it isn't a major concern. Unfortunately the tie interface doesn't do error checking either, so it's scarcely replacement for doing this in your own code.

It's still important to understand how BDB works, by reading the documentation and the C reference. It's really not that hard once you get used to it, and you don't need to remember what everything does, only that it exists. The only exception is log archival. Most people will be happy with log_auto_remove but making that the default makes catastrophic recovery impossible.

Migrating from Subversion to Git

I've been seeing many Subversion repositories being hastily imported to Git. This is unfortunate because not having a cleanly and correctly imported history can reduce the effectiveness of Git's powerful tools, such as git bisect or git blame. Having an accurate revision control history is very helpful for tracking down regressions. Here's my take on how to do this properly.

Subversion issues

There are a few typical problems in Subversion repositories that I've seen:

  • History tends to be crufty (svn ci -m "oops"). Some people consider cleaning such history a bad habit (since it's not what "actually" happened), but IMHO reason to preserve history is so you can figure out the purpose or nature of a change.
  • Merge metadata is missing. Even with merges created using SVK or Subversion 1.5, git-svn doesn't import this information.
  • Tags aren't immutable. People sometimes adjust them to reflect what the release really ended up being, but at that point the tag has effectively become a branch. Again, there's no metadata.

When you make a checkout with git svn the results could often be significantly improved:

  • Git has a very good representation for merges.
  • Git supports clean annotated tags (tags with commit messages).
  • Commit messages could be reformatted to follow Git conventions.

When I converted the Moose Subversion repository I wrote a small collection of scripts.

Preparing a git-svn chekout

If you have made any merges using SVK or Subversion 1.5 (Update: see comments) then you should probably use git-svn from Sam Vilain's svn-merge-attrs branch of Git to save a lot of time when restoring merge information. This version of git-svn will automatically add merge metadata into the imported repository for those commits.

Assuming you have a standard trunk, branches and tags layout, clone the repository like this:

git svn clone --prefix=svn/ --stdlayout --authors-file=authors.txt http://example.com/svn/

For large repositories I like to use svnadmin dump and svnadmin load to create a local copy. You can also just run the conversion on your Subversion server. For local repositories use a file:/// URI.

Cleaning up tags and branches

git svn-abandon-fix-refs

will run through all the imported refs, recreating properly dated annotated tags (but only if they haven't been modified since they were created), and making branches out of everything else. It'll also rename trunk to master.

The resulting layout is more like what a Git repository should look like, so git tag -l and git branch -l work as expected.

Restoring merge information

If some of the merges were made by hand or if you didn't use Sam's git-svn then you'll need to recreate merge metadata by hand. Fortunately this is easily done using the .git/info/grafts file.

The grafts file is a simple table of overridden lists of parents for specific commits. The first column is the commit whose parents you want to override, and the rest of the line is the list of new parents to use. For a regular commit there is only one parent, the previous commit. Merges are commits with more than one parent.

Suppose we have a subversion repository where revision 1 creates a project, revision 2 creates a branch, revision 3 modifies the branch, and revision 4 merges it back into trunk. If imported to Git without the metadata revision 4 will have a single parent, revision 1, but its parents should be 1 and 3.

If the IDs of the imported commits are:

Revision Git Commit
1 e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e
2 7448d8798a4380162d4b56f9b452e2f6f9e24e7a
3 a3db5c13ff90a36963278c6a39e4ee3c22e2a436
4 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1

The line in the .git/info/grafts file that fixes revision 4 would look like this:

9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e a3db5c13ff90a36963278c6a39e4ee3c22e2a436

If you view the history using GitX or gitk then you should now see revision 4 has become a proper merge.

Rewriting history

Most people can happily skip this step.

If you'd to change the history you can now run git rebase --interactive and use the edit command and git commit --amend to clean up any commits or squash to combine commits. This is probably a topic for another post, but it's worth mentioning.

However, make sure you keep other tags and branches synchronized when you rebase. This can be done using the grafts file.

Final cleanups

The last bit of conversion involves running

git svn-abandon-cleanup

to clean up SVK style merge commit messages (where the first line is useless with most Git log viewers), and remove git-svn-id strings.

The actual message filtering is done by the git-svn-abandon-msg-filter script. You can customize this to your liking.

Another important side effect the git filter-branch --all step in git svn-abandon-cleanup is that the grafts entries are incorperated into the filtered commits, so the extra merge metadata becomes clonable.

Finally, all merged branches are be removed (using the safe -d option of git branch).

Publishing

The resulting Git repository should be ready to publish as if you created it locally.

git remote add origin git@example.com:repository.git
git push --all
git push --tags

Nontrivial grafting

You can still cleanly import a repository does not follow the standard directory layout or has other complications (e.g. the repository was moved without importing). Use git-svn to import each directory of history separately and then use grafts to stitch the parts back together.

You can write a script to create a grafts file using git rev-parse, like David Wheeler did for Bricolage.

The grafts file can also be used to hide commits, clean up modified tags, etc.

Tuesday, June 2, 2009

Immutability (Part 3)

Update: Adam Kennedy has written on this topic from a more Perlish point of view.

In order to comfortably work with complex immutable data structures we need appropriate tools. When editing mutable data you may simply locate the point where you want to make a modification, and change it in place. When all data is immutable you obviously can't do that, but you mustn't clone everything manually either. There are tools that let you express purely functional modifications in a manner similar to mutable updates. These tools make the benefits of a immutable data a more attractive tradeoff.

Immutably updating a simple record is easy:

my $renamed = $person->clone(
    name => "Hefetz",
);

The difficulties arise when you have a more complex hierarchy. With just one more level of indirection the update is already a pain in the ass:

my $updated = $container->clone(
    the_person => $person->clone(
        name => "Hefetz",
    ),
);

When modifying an immutable structure you need to capture the context in which you are making the modification, and alter it as well.

Fortunately there are quite a few generic solutions to this problem.

Mapping

Mapping a list of items in Perl is very easy, but not everything is a list. Compared to Haskell's builtin Functor type class Perl's map is pretty lame. In Haskell you can call fmap on any value that is mappable regardless of structure.

In Perl we may think of Functor as a role and of fmap as a method:

my $updated = $container->fmap(sub {
    my $item = shift;

    if ( $item->isa("Person") and $item->name eq $old_name ) {
        return $item->clone( name => $new_name );
    } else {
        return $item;
    }
});

The above code would recurse through $container, and whenever it finds an appropriate Person object it'll replace it with a clone.

Such a mapping is neither very efficient nor expressive, but it clearly demonstrates that you don't need to explicitly specify the context of the transformation. The pattern of cloning a set of dependent objects is implemented in the fmap method and can be reused for many types of updates.

Defining fmap on Person is simple:

method fmap ($callback) {
    $self->$callback();
}

On the container object it's a little more complicated, it needs to recurse into the child first, and then apply the callback on itself:

method fmap ($callback) {
    my $mapped_person = $self->the_person->fmap($callback);

    my $updated = $self->clone(
        the_person => $mapped_person,
    );

    return $updated->$callback();
}

Specialized Transformations

Though handy for bulk updates, for localized updates fmap is a pretty blunt instrument.

Here is a more refined approach:

method update (ArrayRef $path, $callback, @args) {
    if ( @$path ) {
        # updating a child

        my ( $head, @tail ) = @$path;

        return $self->clone(
            $head => $self->$head->update( \@tail, $callback, @args ),
        );
    } else {
        # updating self
        return $self->$callback(@args);
    }
}

To use this update method we provide the path to the item being updated, and a code reference or method name for the actual update operation:

$container->update([qw(the_person)], sub {
    my $self = shift;

    $self->clone(
        name => $new_name,
    );
});

If this were a mutable data structure the same update method would still be useful, the callback would simply change the name in place.

In fact, a polymorphic set_name helper can be defined for both styles. Here is the immutable version:

method set_name ($new_name) {
    return $self->clone( name => $new_name );
}

And the corresponding mutating one:

method set_name ($new_name) {
    $self->name($new_name);
    return $self;
}

Now you can use an identical update call on both mutable and immutable structures:

my $updated = $container->update( [qw(the_person)], set_name => $new_name );

On a mutable structure $updated will be the same as $container.

The drawbacks of this approach is that these update operations can't be combined conveniently without manually breaking it down. The path to update must be specified from the root:

my $updated = $root->update( [qw(path_to_container)], sub {
    my $container = shift;

    $container = $container->update( ... );

    $container = $container->update( ... );

    return $container;
});

For complicated updates affecting many nodes this is still a lot of manual work compared to mutating updates.

Zippers

A more powerful abstraction is a zipper. Zippers let you walk a data structure much like you would with a mutable object, while the zippers capture the context for you. All modifications are performed through the zipper, and in the end the zipper is unwound returning the updated data structure.

In principle this is similar to the update method above, but the path to the updated nodes is managed automatically.

If you are familiar with jQuery then zippers should feel natural. Although the DOM API is very mutation focused jQuery hides this uglyness by providing a clean declarative API. The jQuery API can be easily adapted to perform immutable transformations in much the same manner (I'll post about my forays some other time), but that would be a rather pointless excercise in the browser given the nature of the underlying DOM.

My fellow lambdacamel osfameron has written a toy zipper implementation in Perl. While this is not a serious attempt at a fully features zipper module, it demonstrates both how to write and how to use them. Adapating his code to our example results in something like this:

my $updated = Zip->new( node => $container )
            ->traverse('the_person')
            ->traverse('name')
            ->change_with(sub { $new_name })
            ->unzip;

This will work provided the objects involved followed a calling convention for pure attributes, similar to set_name above. For more details read his code.

A production quality MooseX::Zipper would need to rely on Moose's reflection support to provide more robustness, but nevertheless osfameron's implementation is an interesting demo. Though currently unreleased, MooseX::Data provides abstractions that would make writing a "proper" zipper module easier.

Value Identity

When updating nested structures the cloning operation can be skipped for any object that remains unmodified in the transformation.

We can optimize the definition of fmap for containers:

my $updated = refaddr($mapped_person) == refaddr($person)
    ? $self
    : $self->clone( the_person => $mapped_person );

This simple change takes advantage of the data sharing capabilities of immutable data structures. The size requirements for an fmap on a balanced tree that that modifies a single node will be reduced from O(n) to O(log n).

This raises questions about the identity of complex values. I used refaddr, but the memory location of two objects is a very pessimistic test for equality, it's a Perl idiom that applies to mutable data.

This is made more complex when long term storage is thrown into the mix, since the memory location for two Perl space objects representing the same model object might be different, even if their high level identity is actually the same.

Ideally each immutable object would be able to provide a valid identifier for itself, but this raises issues with namespacing. A valid local identifier is not necessarily valid globally, and the identifier must of course be unique in the face of parallel updates as well.

A simple tool for allocating such identifiers is UUIDs. UUIDs can be generated locally, and will still be valid globally. Unlike sequence numbers they don't need to be synchronized either, and may be safely generated independently of all other writers. Since they are short they can easily be embedded in the data and used to for comparison purposes.

There is still room for improvement. If two complex immutable values are recursively equivalent they share the same identity and may be safely consolidated. This is possible since they are permanently equivalent. Arguably the IDs of the data are the data themselves, but this is not practical because identifiers must usually be small. An obvious solution is to use a cryptographic hash of the data instead.

These two techniques for managing identity of immutable values are very useful in practice, especially when persistence (as in on disk storage) is a factor.

A Language for Immutable Updates

If you're familiar with monads then you might have already guessed that it's pretty trivial to compose zipper operations together monadically. In Haskell the resulting code can be structured like operations on mutable data structures, even though in effect it is completely pure.

This composition can be used to create DSLs for purely functional updates which read exactly like the corresponding mutating code. In this make belief language in which all mutating operations are actually pure operations using zippers, you can still expose the flexibility of immutable data. For instance one could integrate exception handling to roll back to a previous snapshot by providing the global state as it was in the beginning of the try block as an argument to the catch block, providing rollback ability to arbitrary code.

Obviously most languages lack DSL support or syntactic sugar for monadic composition, and therefore don't make this as easy as it could be in an idealized world. However, even though using purely functional transformations will be a bit more verbose or cumbersome than simply mutating the data in these languages, the bulk of the work is still abstracted away.

Summary

I personally don't mind a little extra manual effort to make use of immutable structure as these abstractions are sometimes overkill, but when these updates become cumbersome I quickly reach for these tools.

For instance the upcoming version of the Forest module has support for purely functional trees using a transform method which is virtually identical to the update method detailed above.

My last work project involved storing such trees in a mostly append only KiokuDB model, inspiring this refactoring. The IDs of the tree nodes are SHA-1s of their contents, automating consolidation in the database. As a result of the immutability the tree nodes can be cached indefinitely as memory permits (there's no need to reload them between requests). The number of lines of code that are more complex than their mutating counterparts would have been can be counted on one hand, and are all hidden by the high level model API anyway. The decision to go with an immutable structure was initially done for robustness reasons (we wanted to make use of the persistence property of immutable data), and its rather meager cost ended up paying for itself in performance, adaptability, and testability many times over.