Wednesday, July 7, 2010

Are we ready to ditch string errors?

I can't really figure out why I'm not in the habit of using exception objects. I seem to only reach for them when things are getting very complicated, instead of by default.

I can rationalize that they are better, but it just doesn't feel right to do this all the time.

I've been thinking about what possible reasons (perhaps based on misconceptions) are preventing me from using them more, but I'm also curious about others' opinions.

These are the trouble areas I've managed to think of:

  • Perl's built in exceptions are strings, and everybody is already used to them. [1]
  • There is no convention for inspecting error objects. Even ->isa() is messy when the error could be a string or an object.[2]
  • Defining error classes is a significant barrier, you need to stop, create a new file, etc. Conversely, universal error objects don't provide significant advantages over strings because they can't easily capture additional data apart from the message.[3]
  • Context capture/reporting is finicky
    • There's no convention like croak for exception objects.
    • Where exception objects become useful (for discriminating between different errors), there are usually multiple contexts involved: the error construction, the initial die, and every time the error is rethrown is potentially relevant. Perl's builtin mechanism for string mangling is shitty, but at least it's well understood.
    • Exception objects sort of imply the formatting is partly the responsibility of the error catching code (i.e. full stack or not), whereas Carp and die $str leave it to the thrower to decide.
    • Using Carp::shortmess(), Devel::StrackTrace->new and other caller futzery to capture full information context is perceived as slow.[4]
  • Error instantiation is slower than string concatenation, especially if a string has to be concatenated for reporting anyway.[5]

[1] I think the real problem is that most core errors worth discriminating are usually not thrown at all, but actually written to $! which can be compared as an error code (see also %! which makes this even easier, and autodie which adds an error hierarchy).

The errors that Perl itself throws, on the other hand, are usually not worth catching (typically they are programmer errors, except for a few well known ones like Can't locate Foo.pm in @INC).

Application level errors are a whole different matter though, they might be recoverable, some might need to be silenced while others pass through, etc.

[2] Exception::Class has some precedent here, its caught method is designed to deal with unknown error values gracefully.

[3] Again, Exception::Class has an elegant solution, adhoc class declarations in the use statement go a long way.

[4] XS based stack capture could easily make this a non issue (just walk the cxstack and save pointers to the COPs of appropriate frames). Trace formatting is another matter.

[5] I wrote a small benchmark to try and put the various runtime costs in perspective.

Solutions

Here are a few ideas to address my concerns.

A die replacement

First, I see merit for an XS based error throwing module that captures a stack trace and the value of $@ using a die replacement. The error info would be recorded in SV magic and would be available via an API.

This could easily be used on any exception object (but not strings, since SV magic is not transitive), without weird globals or something like that.

It could be mixed into any exception system by exporting die, overriding a throw method or even by setting CORE::GLOBAL::die.

A simple API to get caller information from the captured COP could provide all the important information that caller would, allowing existing error formatters to be reused easily.

This would solve any performance concerns by decoupling stack trace capturing from trace formatting, which is much more complicated.

The idea is that die would not merely throw the error, but also tag it with context info, that you could then extract.

Here's a bare bones example of how this might look:

use MyAwesomeDie qw(die last_trace all_traces previous_error); # tentative
use Try::Tiny;

try {
 die [ @some_values ]; # this is not CORE::die
} catch {
 # gets data out of SV magic in $_
 my $trace = last_trace($_);

 # value of $@ just before dying
 my $prev_error = previous_error($_);

 # prints line 5 not line 15
 # $trace probably quacks like Devel::StackTrace
 die "Offending values: @$_" . $trace->as_string;
};

And of course error classes could use it on $self inside higher level methods.

Throwable::Error sugar

Exception::Class got many things right but a Moose based solution is just much more appropriate for this, since roles are very helpful for creating error taxonomies.

The only significant addition I would add make is having some sort of sugar layer to lazily build a message attribute using a simple string formatting DSL.

I previously thought MooseX::Declare would be necessary for something truly powerful, but I think that can be put on hold for a version 2.0.

A library for exception formatting

This hasn't got anything to do with the error message, that's the responsibility of each error class.

This would have to support all of the different styles of error printing we can have with error strings (i.e. die, croak with and without $Carp::Level futzing, confess...), but also allow recursively doing this for the whole error stack (previous values of $@).

Exposed as a role, the base API should complement Throwable::Error quite well.

Obviously the usefulness should extend beyond plain text, because the dealing with all that data is a task better suited for an IDE or a web app debug screen.

Therefore, things like code snippet extraction or other goodness might be nice to have in a plugin layer of some sort, but it should be easy to do this for errors of any kind, including strings (which means parsing as much info from Carp traces as possible).

Better facilities for inspecting objects

Check::ISA tried to make it easy to figure out what object you are dealing with.

The problem is that it's ugly, it exports an inv routine instead of a more intuitive isa. It's now possible to go with isa as long as namespace::clean is used to remove so it's not accidentally called as a method.

Its second problem is that it's slow, but it's very easy to make it comparable with the totally wrong UNIVERSAL::isa($obj, "foo") in performance by implementing XS acceleration.

Conclusion

It seems to me if I had those things I would have no more excuses for not using exception objects by default.

Did I miss anything?

Tuesday, July 6, 2010

KiokuDB's Leak Tracking

Perl uses reference counting to manage memory. This means that when you create circular structures this causes leaks.

Cycles are often avoidable in practice, but backreferences can be a huge simplification when modeling relationships between objects.

For this reason Scalar::Util exports the weaken function, which can demote a reference so that its referencing doesn't add to the reference count of the referent.

Since cycles are very common in persisted data (because there are many potential entry points in the data), KiokuDB works hard to support them, but it can't weaken cycles for you and prevent them from leaking.

Apart from the waste of memory, there is another major problem.

When objects are leaked, they remain tracked by KiokuDB so you might see stale data in a multi worker style environment (i.e. preforked web servers).

The new leak_tracker attribute takes a code reference which is invoked with the list of leaked objects when the last live object scope dies.

This can be used to report leaks, to break cycles, or whatever.

The other addition, the clear_leaks attribute allows you to work around the second problem by forcibly unregistering leaked objects.

This completely negates the effect of live object caching and doesn't solve the memory leak, but guarantees you'll see fresh data (without needing to call refresh).

my $dir = KiokuDB->connect(
    $dsn,

    # this coerces into a new object
    live_objects => {
        clear_leaks  => 1,
        leak_tracker => sub {
            my @leaked = @_;

            warn "leaked " . scalar(@leaked) . " objects";

            # try to mop up.
            use Data::Structure::Util qw(circular_off);
            circular_off($_) for @leaked;
        }
    }
);

These options were both refactored out of Catalyst::Model::KiokuDB.

Friday, July 2, 2010

Why another caching module?

In the last post I namedropped Cache::Ref. I should explain why I wrote yet another Cache:: module.

On the CPAN most caching modules are concerned with caching data in a way that can be used across process boundaries (for example on subsequent invocations of the same program, or to share data between workers).

Persistent caching behaves more like on disk databases (like a DBM, or a directory of files), Cache::Ref is like an in memory hash with size limiting:

my %cache;

sub get { $cache{$_[0]} }

sub set {
    my ( $key, $value ) = @_;

    if ( keys %cache > $some_limit ) {
        ... # delete a key from %cache
    }

    $cache{$key} = $value; # not a copy, just a shared reference
}

The different submodules in Cache::Ref are pretty faithful implementations of algorithms originally intended for virtual memory applications, and is therefore appropriate for when the cache is memory resident.

The goal of these algorithms is to try and choose the most appropriate key to delete quickly and without storing too much information about the key, or requiring costly updates on metadata during a cache hit.

This also means less control, for example there is no temporal expiry (i.e. cache something for $x seconds).

If most of CPAN is concerned with L5 caching, then Cache::Ref tries to address L4.

High level interfaces like CHI make persistent caching easy and consistent, but seem to add memory only caching as a sort of an afterthought, with most of the abstractions being appropriate for long term, large scale storage.

Lastly, you can use Cache::Cascade to create a multi level cache hierarchy. This is similar to CHI's l1_cache attribute, but you can have multiple levels and you can mix and match any cache implementation that uses the same basic API.

Thursday, July 1, 2010

KiokuDB's Immutable Object Cache

KiokuDB 0.46 added integration with Cache::Ref.

To enable it just cargo cult this little snippet:

my $dir = KiokuDB->connect(
    $dsn,
    live_objects => {
        cache => Cache::Ref::CART->new( size => 1024 ),
    },
);

To mark a Moose based object as cacheable, include the KiokuDB::Role::Immutable::Transitive role.

Depending on the cache's mood, some of those cacheable objects may survive even after the last live object scope has been destroyed.

Immutable data has the benefit of being cacheable without needing to worry about updates or stale data, so the data you get from lookup will always be consistent, it just might come back faster in some cases.

Just make sure they don't point at any data that can't be cached (that's treated as a leak), and you should notice significant performance improvements.

Monday, June 28, 2010

KiokuDB for DBIC Users

This is the top loaded tl;dr version of the previous post on KiokuDB+DBIC, optimized for current DBIx::Class users who are also KiokuDB non-believers ;-)

If you feel you know the answer to an <h2>, feel free to skip it.

WTF KiokuDB?

KiokuDB implements persistent object graphs. It works at the same layer as an ORM in that it maps between an in memory representation of objects and a persistent one.

Unlike an ORM, where the focus is to faithfully map between relational schemas and an object oriented representation, KiokuDB's main priority is to allow you to store objects freely with as few restrictions as possible.

KiokuDB provides a different trade-off than ORMs.

By compromising control over the precise storage details you gain the ability to easily store almost any data structure you can create in memory.[1].

Why should I care?

Here's a concrete example.

Suppose you have a web application with several types of browsable model objects (e.g. pictures, user profiles, whatever), all of which users can mark as favourites so they can quickly find them later.

In a relational schema you'd need to to query a link table for each possible type, and also take care of setting these up in the schema. When marking an item as a favourite you'd need to check what type it is, and add it to the correct relationship.

Every time you add a new item type you also need to edit the favourite management code to support that new item.

On the other hand, a KiokuDB::Set of items can simply contain a mixed set of items of any type. There's no setup or configuration, and you don't have to predeclare anything. This eliminates a lot of boilerplate.

Simply add a favourite_items KiokuDB column to the user, which contains that set, and use it like this:

# mark an item as a favourite
# $object can be a DBIC row or a KiokuDB object
$user->favourite_items->insert($object);
$user->update;

# get the list of favourites:
my @favs = $user->favourite_items->members;
 
# check if an item is a favourite:
if ( $user->favourite_items->includes($object) ) {
    ...
}

As a bonus, since there's less boilerplate this code can be more generic/reusable.

How do I use it?

First off, at least skim through KiokuDB::Tutorial to familiarize yourself with the basic usage.

In the context of this article you can think of KiokuDB as a DBIC component that adds OODBMs features to your relational schema, as a sort of auxiliary data dumpster.

To start mixing KiokuDB objects into your DBIC schema, create a column that can contain these objects using DBIx::Class::Schema::KiokuDB:

package MyApp::Schema::Result::Foo;
use base qw(DBIx::Class::Core);

__PACKAGE__->load_components(qw(KiokuDB));

__PACKAGE__->kiokudb_column('object');

See the documentation for the rest of the boilerplate, including how to get the $kiokudb handle used in the examples below.

In this column you can now store an object of any class. This is like a delegation based approach to a problem typically solved using something like DBIx::Class::DynamicSubclass.

my $rs = $schema->resultset("Foo");

my $row = $rs->find($primary_key);

$row->object( SomeClass->new( ... ) );

# 'store' is a convenience method, it's like insert_or_update
$row->object in KiokuDB

$row->store;

You can go the other way, too:

my $obj = SomeClass->new(
    some_delegate => $row,
);

my $id = $kiokudb->insert($obj);

And it even works for storing result sets:

use Foo;

my $rs = $schema->resultset("Foo")->search( ... );

my $obj = Foo->new(
    some_resultset => $rs,
);

my $id = $kiokudb->insert($obj);

So you can freely model ad-hoc relationships to your liking.

Mixing and matching KiokuDB and DBIC still lets you obsess over the storage details like you're used to with DBIC.

However, the key idea here is that you don't need to do that all the time.

For example, you can rapidly prototype a schema change before writing the full relational model for it in a final version.

Or maybe you need to preserve an intricate in memory data structure (like cycles, tied structures, or closures).

Or perhaps for some parts of the schema you simply don't need to search/sort/aggregate. You will probably discover parts of your schema are inherently a good fit for graph based storage.

KiokuDB complements DBIC well in all of those areas.

How is KiokuDB different?

There are two main things that traditional ORMs don't do easily, but that KiokuDB does.

First, collections of objects in KiokuDB can be heterogeneous.

At the representation level the lowest common denominator for any two arbitrary object might be nothing at all. This makes it hard to store objects of different types in the same relational table.

In object oriented design it's the interface that matters, not the representation. Conversely, in a relational database only the representation (the columns) matters, database rows have no interfaces.

Second, In an graph based object database the key of an object in the database should only be associated with a single object in memory, but in an ORM this feature isn't necessarily desirable:

  • It doesn't interact well with bulk fetches (for instance suppose a SELECT query fetches a collection of objects, some of which are already in memory. Should the fetched data be ignored? Should the primary keys of the already live objects be filtered out of the query?)
  • It requires additional APIs to control this tracking behavior (KiokuDB's new_scope stuff)

In the interests of flexibility and simplicity, DBIx::Class simply stays out of the way as far as managing inflated object (with one exception being result prefetched and cached resultsets). Whenever a query is is issued you're getting fresh every time.

KiokuDB does track references and provides a stable mapping between reference addresses and primary keys for the subset of objects that it manages.

What sucks about KiokuDB?

It's harder to search, sort and aggregate KiokuDB objects. But you already know a good ORM that can do those bits ;-)

By letting the storage layer in on your object representation you allow the database to help you in ways that it can't if the data is opaque.

Of course, this is precisely where it makes sense to just create a relational table, because DBIx::Class does those things very well.

Why now?

Previously you could use KiokuDB and DBIx::Class in the same application, but the data was kept separate.

Starting with KiokuDB::Backend::DBI version 1.11 you can store part of your model as relational data using DBIx::Class and rest in KiokuDB.

[1] You still get full control over serialization if you want, using KiokuDB::TypeMap, but that is completely optional, and most of the time there's no point in doing that anyway, you already know how to do that with other tools.

Sunday, June 27, 2010

KiokuDB 0.46

rafl and I have just uploaded KiokuDB::Backend::DBI version 1.11 and KiokuDB version 0.46.

These are major releases of both modules, and I will post at length on each of these new features in the coming days:

  • Caching live instances of immutable objects. For data models which favour immutability this should provide significant speedups with minimal code changes and no change in semantics.
  • Leak tracking is now in core. This was previously only available in Catalyst::Model::KiokuDB.
  • KiokuDB::Entry objects can be discarded after use to save memory (until now they were always kept around for as long as the object was still live)
  • Integration between KiokuDB managed objects and DBIx::Class managed rows, allowing for mixed relational/graph schemas as in this job queue example.

Friday, June 18, 2010

I hate software

A long standing bug in Directory::Transactional has finally been fixed.

Evidently, universally unique identifiers are only unique as long as the entire universe is contained within a single UNIX process, at least as far as e2fsprogs' libuuid is concerned.

These "unique" strings were used to create names for transaction work directories, so when they in fact turned out to be the same fucking strings across forks, the two processes would overwrite each others' private data.

uuid(3) doesn't even contain any information on how to reseed it even if I would bother checking for that myself.

I simply cannot fathom how a pseudorandom number generator is being used for such a library without taking forking into account. Isn't this stuff supposed to be reliable?