Tuesday, June 2, 2009

Immutability (Part 3)

Update: Adam Kennedy has written on this topic from a more Perlish point of view.

In order to comfortably work with complex immutable data structures we need appropriate tools. When editing mutable data you may simply locate the point where you want to make a modification, and change it in place. When all data is immutable you obviously can't do that, but you mustn't clone everything manually either. There are tools that let you express purely functional modifications in a manner similar to mutable updates. These tools make the benefits of a immutable data a more attractive tradeoff.

Immutably updating a simple record is easy:

my $renamed = $person->clone(
    name => "Hefetz",
);

The difficulties arise when you have a more complex hierarchy. With just one more level of indirection the update is already a pain in the ass:

my $updated = $container->clone(
    the_person => $person->clone(
        name => "Hefetz",
    ),
);

When modifying an immutable structure you need to capture the context in which you are making the modification, and alter it as well.

Fortunately there are quite a few generic solutions to this problem.

Mapping

Mapping a list of items in Perl is very easy, but not everything is a list. Compared to Haskell's builtin Functor type class Perl's map is pretty lame. In Haskell you can call fmap on any value that is mappable regardless of structure.

In Perl we may think of Functor as a role and of fmap as a method:

my $updated = $container->fmap(sub {
    my $item = shift;

    if ( $item->isa("Person") and $item->name eq $old_name ) {
        return $item->clone( name => $new_name );
    } else {
        return $item;
    }
});

The above code would recurse through $container, and whenever it finds an appropriate Person object it'll replace it with a clone.

Such a mapping is neither very efficient nor expressive, but it clearly demonstrates that you don't need to explicitly specify the context of the transformation. The pattern of cloning a set of dependent objects is implemented in the fmap method and can be reused for many types of updates.

Defining fmap on Person is simple:

method fmap ($callback) {
    $self->$callback();
}

On the container object it's a little more complicated, it needs to recurse into the child first, and then apply the callback on itself:

method fmap ($callback) {
    my $mapped_person = $self->the_person->fmap($callback);

    my $updated = $self->clone(
        the_person => $mapped_person,
    );

    return $updated->$callback();
}

Specialized Transformations

Though handy for bulk updates, for localized updates fmap is a pretty blunt instrument.

Here is a more refined approach:

method update (ArrayRef $path, $callback, @args) {
    if ( @$path ) {
        # updating a child

        my ( $head, @tail ) = @$path;

        return $self->clone(
            $head => $self->$head->update( \@tail, $callback, @args ),
        );
    } else {
        # updating self
        return $self->$callback(@args);
    }
}

To use this update method we provide the path to the item being updated, and a code reference or method name for the actual update operation:

$container->update([qw(the_person)], sub {
    my $self = shift;

    $self->clone(
        name => $new_name,
    );
});

If this were a mutable data structure the same update method would still be useful, the callback would simply change the name in place.

In fact, a polymorphic set_name helper can be defined for both styles. Here is the immutable version:

method set_name ($new_name) {
    return $self->clone( name => $new_name );
}

And the corresponding mutating one:

method set_name ($new_name) {
    $self->name($new_name);
    return $self;
}

Now you can use an identical update call on both mutable and immutable structures:

my $updated = $container->update( [qw(the_person)], set_name => $new_name );

On a mutable structure $updated will be the same as $container.

The drawbacks of this approach is that these update operations can't be combined conveniently without manually breaking it down. The path to update must be specified from the root:

my $updated = $root->update( [qw(path_to_container)], sub {
    my $container = shift;

    $container = $container->update( ... );

    $container = $container->update( ... );

    return $container;
});

For complicated updates affecting many nodes this is still a lot of manual work compared to mutating updates.

Zippers

A more powerful abstraction is a zipper. Zippers let you walk a data structure much like you would with a mutable object, while the zippers capture the context for you. All modifications are performed through the zipper, and in the end the zipper is unwound returning the updated data structure.

In principle this is similar to the update method above, but the path to the updated nodes is managed automatically.

If you are familiar with jQuery then zippers should feel natural. Although the DOM API is very mutation focused jQuery hides this uglyness by providing a clean declarative API. The jQuery API can be easily adapted to perform immutable transformations in much the same manner (I'll post about my forays some other time), but that would be a rather pointless excercise in the browser given the nature of the underlying DOM.

My fellow lambdacamel osfameron has written a toy zipper implementation in Perl. While this is not a serious attempt at a fully features zipper module, it demonstrates both how to write and how to use them. Adapating his code to our example results in something like this:

my $updated = Zip->new( node => $container )
            ->traverse('the_person')
            ->traverse('name')
            ->change_with(sub { $new_name })
            ->unzip;

This will work provided the objects involved followed a calling convention for pure attributes, similar to set_name above. For more details read his code.

A production quality MooseX::Zipper would need to rely on Moose's reflection support to provide more robustness, but nevertheless osfameron's implementation is an interesting demo. Though currently unreleased, MooseX::Data provides abstractions that would make writing a "proper" zipper module easier.

Value Identity

When updating nested structures the cloning operation can be skipped for any object that remains unmodified in the transformation.

We can optimize the definition of fmap for containers:

my $updated = refaddr($mapped_person) == refaddr($person)
    ? $self
    : $self->clone( the_person => $mapped_person );

This simple change takes advantage of the data sharing capabilities of immutable data structures. The size requirements for an fmap on a balanced tree that that modifies a single node will be reduced from O(n) to O(log n).

This raises questions about the identity of complex values. I used refaddr, but the memory location of two objects is a very pessimistic test for equality, it's a Perl idiom that applies to mutable data.

This is made more complex when long term storage is thrown into the mix, since the memory location for two Perl space objects representing the same model object might be different, even if their high level identity is actually the same.

Ideally each immutable object would be able to provide a valid identifier for itself, but this raises issues with namespacing. A valid local identifier is not necessarily valid globally, and the identifier must of course be unique in the face of parallel updates as well.

A simple tool for allocating such identifiers is UUIDs. UUIDs can be generated locally, and will still be valid globally. Unlike sequence numbers they don't need to be synchronized either, and may be safely generated independently of all other writers. Since they are short they can easily be embedded in the data and used to for comparison purposes.

There is still room for improvement. If two complex immutable values are recursively equivalent they share the same identity and may be safely consolidated. This is possible since they are permanently equivalent. Arguably the IDs of the data are the data themselves, but this is not practical because identifiers must usually be small. An obvious solution is to use a cryptographic hash of the data instead.

These two techniques for managing identity of immutable values are very useful in practice, especially when persistence (as in on disk storage) is a factor.

A Language for Immutable Updates

If you're familiar with monads then you might have already guessed that it's pretty trivial to compose zipper operations together monadically. In Haskell the resulting code can be structured like operations on mutable data structures, even though in effect it is completely pure.

This composition can be used to create DSLs for purely functional updates which read exactly like the corresponding mutating code. In this make belief language in which all mutating operations are actually pure operations using zippers, you can still expose the flexibility of immutable data. For instance one could integrate exception handling to roll back to a previous snapshot by providing the global state as it was in the beginning of the try block as an argument to the catch block, providing rollback ability to arbitrary code.

Obviously most languages lack DSL support or syntactic sugar for monadic composition, and therefore don't make this as easy as it could be in an idealized world. However, even though using purely functional transformations will be a bit more verbose or cumbersome than simply mutating the data in these languages, the bulk of the work is still abstracted away.

Summary

I personally don't mind a little extra manual effort to make use of immutable structure as these abstractions are sometimes overkill, but when these updates become cumbersome I quickly reach for these tools.

For instance the upcoming version of the Forest module has support for purely functional trees using a transform method which is virtually identical to the update method detailed above.

My last work project involved storing such trees in a mostly append only KiokuDB model, inspiring this refactoring. The IDs of the tree nodes are SHA-1s of their contents, automating consolidation in the database. As a result of the immutability the tree nodes can be cached indefinitely as memory permits (there's no need to reload them between requests). The number of lines of code that are more complex than their mutating counterparts would have been can be counted on one hand, and are all hidden by the high level model API anyway. The decision to go with an immutable structure was initially done for robustness reasons (we wanted to make use of the persistence property of immutable data), and its rather meager cost ended up paying for itself in performance, adaptability, and testability many times over.

4 comments:

Anonymous said...

> my ( $head, @tail ) = @$path;

This bothers me. Unlike functional languages where it's particularly cheap, in Perl you'll have O(n^2) memory of @$path copied around. Presumably you want to run this stuff on structures where this actually matters?

I don't know if there's a way to get cheap splice views on arrays in Perl, but if there is and you're going to be calling $h->f(\@t) recursively on large lists please use it for greater good, k thx.

nothingmuch said...

There aren't, though you could easily fake that with tying, or pass an index as an argument.

That said, in practice it rarely matters, doing that would be a microoptimization. I do use this copying all the time and have never had any performance issues with such code.

The overhead of allocating an array of even 100 members is negligiable, you're allocating an array of about 5 variables on every sub call for the pad, Perl's stack is an array which may be extended, etc etc. I doubt one could measure a difference for paths of under several hundred elements, and most paths tend to be a lot shorter than that.

Anonymous said...

I was thinking more of structures you're pulling off the store, which could reasonably have 10k entries and where this can start to matter.

nothingmuch said...

A traversal path in N entries usually scales logarithmically compared to N.

If you're cloning 10k objects just to take a single traversal step I think that's a totally different ballpark, and yes, it's not really reasonable.


If you're cloning 10k paths to make 10k traversals, then arguably you've loaded 10k objects 10k times, so it still pales in comparison ;-)