Saturday, September 5, 2009

Cache Assertions

Caching an expensive computation is a great way to improving performance, but when it goes wrong the bugs can be very subtle. It's vital to be sure sure that the cached result is, for all intents and purposes, the same as what the computation would have produced.

A technique I often employ is using a thunk to defer the computation:

$cache_helper->get_value( $key, sub {
   ... produce the expensive result here ...
});

If there is a cache hit then the subroutine is simply not executed:

sub get_value {
    my ( $self, $key, $thunk ) = @_;

    if ( my $cache_hit = $self->cache->get($key) ) {
        return $cache_hit;
    } else {
        my $value = $thunk->();

        $self->cache->set( $key => $value );

        return $value;
    }
}

The get_value method must be referentially transparent; when invoked with the same $key it should always produce a consistent result regardless of the state of the cache.

When adding caching to existing code I often accidentally choose a cache key that doesn't contain all the implied parameters since the details of the code I'm caching are no longer fresh in my memory.

The reason I like the thunk approach is that $cache_helper can be swapped for something else to easily check for such caching bugs.

For instance, to confirm that a reproducible problem is a cache related bug, you can disable caching temporarily using this implementation:

sub get_value {
    my ( $self, $key, $thunk );

    return $thunk->();
}

Or better yet, use an asserting helper in your test environment, where performance isn't important:

sub get_value {
    my ( $self, $key, $thunk ) = @_;

    my $value = $thunk->();

    if ( my $cache_hit = $self->cache->get_value($key) ) {
        $self->check_cache_hit($cache_hit, $value);

        # it's important to return $cache_hit in case check_cache_hit
        # isn't thorough enough, so that this doesn't mask any bugs
        # that would manifest only in production
        return $cache_hit;
    } else {
        $self->cache->set( $key => $value );

        return $value;
    }
};

sub check_cache_hit {
    my ( $self, $cache_hit, $value ) = @_;

    confess "Bad cache hit: $key, $cache_hit != $value"
        unless ...;
}

This approach can be used on any sort of cache, from $hash{$key} ||= $thunk->() to CHI.

It's useful to have several get_value like methods, to handle the different result types more easily (check_cache_hit can be a little intricate, and adding namespacing to keys is always a good idea).

This is useful for other reasons too, for instance collect timing information, to measure the effectiveness of caching, or use different cache storage for different result types.

The key to using this effectively is to make the caching object an easily overridable delegate in your application's code. For instance, if you're using Catalyst you can specify the subclass in your configuration, or even make it conditional on $c->debug.

Lastly, it's worth mentioning that my sample implementation is actually wrong if false values are valid results.

2 comments:

DFH said...

Why do you use a thunk instead of delegating to a helper class, say, or hard coding the computation into the caching class (perhaps derived from a more generic class)?

nothingmuch said...

If no caching was in place this would be lexically almost the same code, the caching merely adds the call to the cache helper and the curlies around the cached expression.

Introducing the extra thunk indirection is a minimal code change from what it would have been without caching.

If the caching logic is not a fundamental part of the algorithm then I usually see it as a cross cutting concern that should impact the code as little as possible.