Sunday, May 24, 2009

Immutable Data Structures

I doubt anyone could make a case against simplicity in software. But all problems have some inherent complexity which cannot be removed. Everything we do will end up adding to this. I think that the real art of programming is to finding a solution that adds the least amount of complexity possible. Immutable data structures are a good way of trading complexity in one place for complexity in another, but unfortunately they are not that popular in the Perl world.

Perl is a very rich language. It comes with a very wide spectrum of styles to choose from. This is very much a part of Perl's culture, too, and obviously the fact that there are many different language primitives means that the number of ways to combine them is much larger. But TIMTOWDI is a mixed bag, there is usually not more than one good way to do something, and which one it is depends on the context.

People usually contrast this with the zen of Python, that is that there should only be one way to do something. I think a better counter example is purely functional programming. Compared to Perl, Python is still a rich and complex language. Its one true way of doing anything is based on style and opinion, it isn't a real necessity inherent in the language's structure.

The gist of purely functional languages (with pure being the key word, not functional) is that data never changes. There are no update operations at all. Instead of modifying data in place, you must make a copy of it, overriding the values you want to change.

Of course, we can do this in Perl too:

use MooseX::Declare;

class Person {
    has name => (
        isa => "Str",
        is  => "ro",
    );
}

Suppose you have a Person object and you'd like to update the name attribute, the common way to do this would be:

$person->name("Hefetz");

However, since $person is an immutable object (the name attribute is marked ro) you would need to do something like this instead:

my $renamed = $person->clone(
    name => "Hefetz",
);

Likewise, any object pointing at the person would have to be cloned to have its copy of $person replaced with $renamed, or it will not reflect the change.

This is not a common practice in Perl because it often seems easier to just update values in place, but in a purely functional way this is the only way to update data.

To a typical Perl programmer this would seem like a much more complicated way of handling the data, but really it's actually a pretty powerful tradeoff. Though the extra copying is an added complexity, operations that take this immutable data as inputs have a much simpler set of rules to follow on how the data may be used.

One such benefit is never having to recompute anything. This is called referential transparency. It's the property that applying a function to the same arguments will always return the same result. I believe that the real benefit lies not in potential efficiency gains (though that's is another plus), but in the fact that you can make assumptions about the data in your code, and these assumptions are pretty resistant to changes in the code as well.

Let's say you realize that a data field must suddenly become updatable. While there is a an undeniable amount of effort involved in restructuring everything to support multiple copies of the field's container, once that's done it's done. That code will not come back to haunt you because of this change.

Instead of needing to update dependent values you recreate them in relation to the updated copy. The old data is still usable (and will always remain valid). This may sound like a lot of work, but it's much easier to keep one version of an algorithm, instead of two (one to create and another to update).

This may seem limiting at first, but once you get used to working this way it's an excellent tool for identifying which parts of the data should and shouldn't change (and if they should, when and how), allowing you to model the data with more clarity. In the long run this is a big win for simplicity. The code might be solving a complex problem, but it's simpler to adapt and reuse.

One thing to keep in mind is that just because you use immutability to your benefit it doesn't mean you need to use it all the time. Before this change in KiokuDB, if the store operation resulted in an error then some objects would be left registered in the live object set even though they hadn't actually been stored. Looking for and removing these objects would have been very hard but obviously the right thing todo.

The easy solution was to treat the live object set as immutable for the duration of a single store operation. All the operations are made on a temporary buffer object. If everything completed successfully then the buffer's changes are committed to the live object set at the end. I lacked the foresight to do this from the start because the set of live objects is inherently mutable. The extra layer of indirection seemed like added complexity (every lookup on the live object set would have to check the buffer first), but in the end I think it was a definite improvement.

An example of a much bigger system leveraging immutability is Git. When writing a version control system you must obviously keep full history of the data. Most systems have approached this as a problem that needs to be overcome. Git is somewhat unique in that it takes advantage of this property in the model's design. By using a simple snapshot based model instead of a complicated delta based one it's able to provide better performance, safety, security and flexibility.

Casual observers initially criticised git for having a model so simple it was actually naive. It turns out they were confusing the model with its on disk representation. Git makes this distinction very well, and the result is that it implements powerful features (for instance idempotent patch application) which are apparently too complicated in other systems. Git itself isn't simple at all, the problem of version control is a complicated one so any system dealing with it is inherently complex. Git's advantage is that it's built on a very simple and future proof core, allowing the complex parts to evolve more easily.

Anymoose, this post getting pretty long, so I think I will put off more technical details for a later date. This should make a nice series of posts. I think it should also help me to write my talk at YAPC::NA this year.

In a future post I'll probably try and discuss the implications (positive and negative) of using immutable data structures in real code.

If you want some good reading material for this topic, two of my favourite programming books are Purely Functional Data Structures and Algorithms: A Functional Approach. Both are surprisingly short but cover a lot of ground. A good introduction to this whole approach is the article Worlds: Controlling the Scope of Side Effects.

6 comments:

kroiz said...

Very interesting.

Unknown said...

I really wish Moose was 'ro' by default... would have been so much nicer.

jrockway said...

Moose is neither ro nor rw by default.

zby said...

As a kind of counterargument see http://www.cs.cmu.edu/~NatProg/papers/Stylos2007CreateSetCall.pdf - apparently it is easier (faster to learn and produces less errors) to create the object first (with no initial parameters) and then use accessors to set the attributes.

hdp said...

This article counters Yuval's post in the same way that one could counter "these apples are delicious" by saying "but many people prefer seedless grapes to grapes with seeds in", that is to say, not at all.

(Originally commented here.)

Mark Stosberg said...

Considering adding a link to the next Immutable post at the bottom, so people can more easily read the series in flow.

http://blog.woobling.org/2009/05/immutable-data-structures-cont.html