Friday, July 2, 2010

Why another caching module?

In the last post I namedropped Cache::Ref. I should explain why I wrote yet another Cache:: module.

On the CPAN most caching modules are concerned with caching data in a way that can be used across process boundaries (for example on subsequent invocations of the same program, or to share data between workers).

Persistent caching behaves more like on disk databases (like a DBM, or a directory of files), Cache::Ref is like an in memory hash with size limiting:

my %cache;

sub get { $cache{$_[0]} }

sub set {
    my ( $key, $value ) = @_;

    if ( keys %cache > $some_limit ) {
        ... # delete a key from %cache

    $cache{$key} = $value; # not a copy, just a shared reference

The different submodules in Cache::Ref are pretty faithful implementations of algorithms originally intended for virtual memory applications, and is therefore appropriate for when the cache is memory resident.

The goal of these algorithms is to try and choose the most appropriate key to delete quickly and without storing too much information about the key, or requiring costly updates on metadata during a cache hit.

This also means less control, for example there is no temporal expiry (i.e. cache something for $x seconds).

If most of CPAN is concerned with L5 caching, then Cache::Ref tries to address L4.

High level interfaces like CHI make persistent caching easy and consistent, but seem to add memory only caching as a sort of an afterthought, with most of the abstractions being appropriate for long term, large scale storage.

Lastly, you can use Cache::Cascade to create a multi level cache hierarchy. This is similar to CHI's l1_cache attribute, but you can have multiple levels and you can mix and match any cache implementation that uses the same basic API.


Jakub Narebski said...

Why not use heap for implementing LRU algorithm? Unless 'array' is really array implementation of heap?

nothingmuch said...

That would require timestamping or a counter but otherwise would be doable.

The doubly linked list is O(1) so there's no real point in a heap, you just add at the end, take from the begining, and splice to the end on a hit.

In a heap, you'd still have to splice out of the middle and reinsert it, which is not guaranteed to be cheap on a binary heap, and the major reason the array implementation is slow is because splice is an O(N) operation. It's probably faster on pointer based heap algorithms, but there's no real point since no comparison is necessary between elements, n insert the element is always at the end.

Jakub Narebski said...

Errr... I was thinking about implementing priority queue using array implementation of heap. Adding an element to heap is O(log N), as is removing "smallest" element from heap. Getting "smallest" element from heap is O(1).

You can implement heap in array, by assuming that children of element at i-th position (if they exist) are at 2*i and 2*i+1 indices. See "Programming Pearls" by Jon Bentley (where such structure was used to implement heapsort).

Unless there are some operations beside "add element" and "remove smallest" that are needed for LRU algorithm bookkeeping?

nothingmuch said...

Heaps are optimized to add an element anywhere in the ordering, and to remove from the head.

To update LRU we need to add an element only to the tail (therefore ordering is not important, the element being inserted is always "biggest"), to remove elements from the head, and most importantly to remove elements from the middle (every hit moves the key to the end of the list again, since it's more recently used).

Implementing a binary heap over an array is definitely nice for heapsorting because it's a very compact representation, but in this case not helpful because to extract an element from the middle and to reinsert it with a new key is O(N) + O(log N), since all the entries after it in the heap need to be moved up one spot, instead of log N swaps.