Friday, May 29, 2009

Immutable Data Structures (cont.)

Blah blah, immutable is great, functionally pure is cool, but isn't it slow with all that copying? Well, duh, if you misuse it. Then again, so is anything else. I've already made the unsubstantiated claim that immutability leads to better code, so in this post I will try to focus on more measurable advantages such as performance.

In 2009 everyone seems to want to scale. Cloud this, cluster that, consistent hashing, and so on. I firmly believe you need to actually have a successful product in order to run into scaling problems. If you are lucky enough to have a real need for scaling then you're probably aware that aggressive caching is a pretty reliable way of improving scalability.

The challenge of caching is making sure the cached data you are using is still valid. If the data has been updated then data which depends on it is now out of date too. This means that updates need to clear (or update) the cache. If there are multiple caches for a single master then these write operations might need to be replicated to all the caches.

Obviously this is no longer a problem if the data you are trying to cache is immutable. The drawback is that the key must change each time it's updated. One could argue that this is the same problem: we're still fetching data from the database (or a cache that is kept up to date) each request. The difference lies in how much data we fetch and how costly or hard it is to fetch it or keep it up to date.

Suppose you're storing images. If you name each file based on the hash of its contents you've just created a stable ETag. Since the URL will encapsulate the ETag, it's valid forever. You can set Expires and Cache-Control to a time when robots will rule the earth. Duplicate files will be consolidated automatically, and there's no need to worry about generating sequence numbers so the data is easy to replicate and change in a distributed setup.

This can be much finer grained, too. For instance to cache parts of a complicated page you can use hashes or sequence numbers from the database as cache keys. Assuming the data is immutable you can do simple app level caching or use these keys to generate ESI identifiers. The implementation of the caching can be adapted very easily to meet your requirements, without requiring massive changes to the way you store your data. There are no issues with the data being out of sync in a single page if you switch to ESI due to scaling considerations, since everything has a stable identifier.

I previously used Git as an example for a project which gains efficiency by using immutability. In Git the only thing that is modified in place is references (branches). Since file and revision data is addressed by its contents (using its SHA1 hash), this 40 byte identifier (well, 20 actually) can be used to describe the full state of a branch no matter how large the data in the branch is. When you run git remote update the references are updated, and new revision data is copied only if necessary. You're still fetching data each time, but the update is very quick if only a few bytes per branch needs to be downloaded, which is the case if nothing has changed. In contrast rsync, which synchronizes mutable data, needs to work a lot harder to compare state.

The guiding principle here is to shift the update operations upwards, towards the root of the data instead of near the leaves. The same techniques apply to web applications as well.

Propagating changes upwards also reduces transactional load on the database. If the data is keyed by its contents or by a UUID you don't need to generate sequence numbers synchronously, or worry about update contention. You can easily replicate and shard the bulk of the data without transactional semantics (all you need is Durability, the D in ACID), while keeping the fully transactional updates comparatively cheap due to their smaller size. Consistency is still guaranteed because once you can refer immutable data, it's always consistent. If the transactional parts need to be distributed eventual consistency is easier to achieve when the bulk of the data is only added, while destructive updates are kept small and simple.

Though this is not really relevant in Perl, there are also benefits to multithreaded code. In a low level language like C where you can royally screw things by accessing shared data without locking, immutable structures provide both speed (no need to lock) and safety (no deadlocks, no inconsistent state) when shared. Ephemeral inconsistency is very hard to reproduce, let alone fix. You obviously still need to take locks on mutable variables pointing to immutable structures, but the data inside the structure can be used lock free. Immutability is also a key part of why STM is so successful in Haskell. If most operations are pure and only a few MVars are susceptible to thread contention then the optimistic concurrency is usually optimal. The overhead of such a high level abstraction ends up being pretty low in practice.

Immutable models are also more resilient to bugs and data corruption. By distilling the model into its most succinct/normalized representation you can express things more clearly, while still easily supporting safe and clean denormalization, without needing to synchronize updates. The "core" data is authoritative, and the rest can be regenerated if needed (even lazily). Assuming you've written tests for your model you can be reasonably more sure of it, too. There is no possibility for action at a distance or sensitivity to the ordering of actions. If the data is correct the first time it will stay correct, so bugs and oversights tend to get ironed out early on in the development of the model.

If you did make a mistake and allowed an invalid structure to be created it's usually also easier to recover the data. This is especially true if you make use of persistence, because you'll have more detail than just a summary of the cause and effect chain.

However, there is also a major concern to be aware of which many people don't anticipate: cleanup. Especially when privacy is an issue, assuming immutability might make things harder. The more you take advantage of data sharing the harder garbage collection becomes, especially in a content addressable keyspace. If you replicate and cache your data deleting it is still easier than updating it correctly, but it's not as easy as just writing it and forgetting about it.

Finally, I'd like to emphasize the distinction between purely functional data structure and a simply immutable one. If an immutable object contains references to mutable data, directly or indirectly, it isn't pure. Most of these benefits assume purity. Impurities can sometimes take out the pain of making everything immutable by letting you cut corners, but it's tempting to go too far due to short sighted laziness and lose out on everything in the long run.

At the end there's usually an obvious choice between a mutable or an immutable design for a particular problen, but mutability seems to be the default for premature optimization reason. Unless you are absolutely sure immutable data will be really slow and wasteful in the big picture, and you also know how much slower it will be, there's usually no good reason to prefer mutability.

In the next post on this subject I will go over some techniques for easily working with immutable data structures. Updating immutable data by hand can be a pain in the ass, but it rarely has to be done that way.

Update: a low level example from OCaml.

1 comment:

Hercynium said...

I look forward to the next post on this... especially if you'll be showing off some Perl code :-)