Thursday, March 18, 2010

What is a mixed schema?

Yesterday's post is a technical one that says that KiokuDB and DBIx::Class can now be used together on the same schema. What it doesn't explain is what this is actually good for.

Most of the application development we do involves OLTP in one form or another. Some of the apps also do reporting on simple, highly regular data.

KiokuDB grew out of the need to simplify the type of task we do most often. For the reporting side we are still trying to figure out what we like best. For example Stevan has been experimenting with Fey (not Fey::ORM) for the purely relational data.

This approach has been far superior to what we had done before: forcing a loosely constructed, polymorphic set of objects with no reporting requirements into a normalized relational schema that's optimized for reporting applications. There is a also new, worse alternative, which is to run aggregate reports on several million data points as in memory objects with Perl ;-)

However, the two pronged approach still has a major drawback: the two data sets are completely separate. There is no way to refer to data in the two sets without embedding knowlege about the database handles into the domain, which is tedious and annoying.

What the new DBIx::Class integration allows is to bridge that gap.

Concrete Example #1: Mixing KiokuDB into a DBIC centric app

Often times I would find myself making compromises about what sort of objects I put into a relational schema.

There is a tension between polymorphic graphs of objects and normalizd relational schema.

Suppose you're writing an image gallery application, and you decide to add support for YouTube videos. Obviously YouTube videos should be treated as image objects in the UI, they should tile with the images, you should be able to rearrange them, add captions/tags, post comments, etc.

This is precisely where polymorphism makes sense, you have two types of things that being used in a single context, but with a completely different representation. One is probably represented by a collection of files on disk, for the original image, previews, thumbnails, etc, and table entry of metadata. The other is represented by an opaque string ID, and most of its functionality is derived by generating calls to a web service.

How do you put YouTube videos into your image table? Do you add a type column? What about a resource table that has NULLable foreign keys to the image table and a NULLable video_id column? What about a blob column containing serialized information about the data?

With a mixed schema you could create a resource table that has a foreign key to the KiokuDB entries table. You could use the resources table for things like random selection, searches, keeping track of views counts, etc.

I'm going to assume that you're not really interested on running reports on which characters show up most often in the YouTube video IDs or what is the average length of image filenames, so that data can be opaque without compromising any features in your application.

On a technical level this is is similar to using a serialized blob column approach, or some combination of DBIx::Class::DynamicSubclass and DBIx::Class::FrozenColumns.

However, by using KiokuDB these objects become first class citizens in your schema, instead of some "extra" data that is tacked on to a data row. You get a proper API for retrieving and updating real graphs of objects, much more powerful and automatable serialization, a large number of standard modules that are supported out of the box, etc.

Perhaps most importantly, the encapsulation and notion of identity is maintained. You can share data between objects, and that data sharing is reflected consistently in memory. You can implement your MyApp::Resource::YouTubeVideo and MyApp::Resource::Image without worrying about mapping columns, or weird interactions with Storable. That, to me, is the most liberating part of using KiokuDB.

Concrete Example #2: Mixing DBIC into a KiokuDB centric app

On the other side of the spectrum (of our apps, anyway) you'll find data models that are just too complicated to put into a relational schema easily; there are mixed data types all over the place, complex networks of data (we've put trees, graphs, DAGs, and other structures, sometimes all in a single app), and other things that are incredibly useful for rapid prototyping or complicated processing.

This usually all works great until you need an aggregate data type at some point. That's when things fall apart. Search::GIN is not nearly as feature complete as I hoped it would be right now, in fact, it's barely a draft of a prototype. The DBI Backend's column extraction is a fantastically useful hack, but it's still just a hack at heart.

But now we can freely refer to DBIC rows and resultsets just like we can in memory, from our OO schema, to help with these tasks.

One of our apps used a linked list to represent a changelog of an object graph, somewhat similarly to Git's object store. After a few months of deployment, we got a performance issue from a client, a specific page was taking about 30 seconds to load. It turned out that normally only the last few revisions had to be queried, but on that specific cases a pathological data construction meant that over a thousand revisions were loaded from the database and had their data analyzed. Since this linked list structure is opaque, this was literally hitting the database thousands of times in a single request.

I ended up using a crude cache to memoize some of the predicates, which let us just skip directly to the revision that had to be displayed.

With the new features in the DBI backend I could simply create a table of revision containers (I would still need to store revisions in KiokuDB, because there were about 6 different revision types), on which I could do the entire operation with one select statement.

Conceptually you can consider the DBIC result set as just an object oriented collection type. It's like any other object in KiokuDB, except that its data is backed by a much smarter representation than a serialized blob, the underlying data store understands it and can query its contents easily and efficiently. The drawback is that it requires some configuration, and it can only contain objects of the same data type, but these are very reasonable limitations, after all we've been living with them for years.

It's all a bit like writing a custom typemap entry to better represent your data to the backend. In fact, this is pretty much exactly what I did to implement the feature ;-)

This still requires making the effort to define a relational schema, but only where you need it, and only for data that make sense in a relational setting anyway. And it's probably less effort than writing a custom typemap to create a scalable/queriable collection type.


Though still far from perfect, I feel that this really brings KiokuDB into a new level of usefulness; you no longer need to drink the kool aid and sacrifice a powerful tool and methodology you already know.

Even though DBIC is not everyone's tool of choice and has its own drawbacks, I feel that is by far the most popular Perl ORM for a reason, which is why I chose to build on it. However, there's no reason why this approach can't be used for other backend types.

Eventually I'd like to be able to see similar typemaps emerge for other backends. For example the Redis backend could support Redis' different data types, CouchDB has design documents and views, and riak's MapReduce jobs and queries (Franck's backend is on GitHub) could all be reflected as "just objects" that can coexist with other data in a KiokuDB object graph.

Wednesday, March 17, 2010

KiokuDB ♡ DBIx::Class

I just added a feature to KiokuDB's DBI backend that allows freely mixing DBIx::Class objects.

This resolves KiokuDB's limitations with respect to sorting, aggregating and querying by letting you use DBIx::Class for those objects, while still giving you KiokuDB's flexible schema for everything else.

The first part of this is that you can refer to DBIx::Class row objects from the objects stored in KiokuDB:

my $dbic_object = $resultset->find($primary_key);

    some_id => Some::Object->new( some_attr => $dbic_object ),

The second half is that relational objects managed by DBIx::Class can specify belongs_to type relationships (i.e. an inflated column) to any object in the KiokuDB entries table:

my $row = $rs->create({ name => "blah", object => $anything );


say "Inserted ID for KiokuDB object: ",
To set things up you need to tell DBIx::Class about KiokuDB:
package MyApp::Schema;
use base qw(DBIx::Class::Schema);

# load the KiokuDB schema component
# which adds the extra result sources


package MyApp::Schema::Result::Foo;
use base qw(DBIx::Class);

# load the KiokuDB component:
__PACKAGE__->load_components(qw(Core KiokuDB));

# do the normal stuff
__PACKAGE__->add_columns(qw(id name object));

# setup a relationship column:

# connect both together
my $dir = KiokuDB->connect(
    dsn => "dbi:SQLite:dbname=blah",
    schema_proto => "MyApp::Schema",

my $schema = $dir->backend->schema;

# then you can do some work:
$dir->txn_do( scope => 1, body => sub {
    my $rs = $schema->resultset("Foo");
    my $obj = $rs->find($primary_key)->object;



There are still a few missing features, and this is probably not production ready, but please try it out! A dev release will be out once I've documented it. KiokuDB::Backend::DBI 0.11_01.

In the future I hope to match all of Tangram's features, enabling truly hybrid schemas. This would mean that KiokuDB could store objects in more than one table, with objects having any mixture of properly typed, normalized columns, opaque data BLOBs, or something in between (a bit like DBIx::Class::DynamicSubclass and DBIx::Class::FrozenColumns, but with more flexibility and less setup).

Sunday, March 14, 2010

git snapshot

I've just uploaded a new tool, git snapshot, which lets you routinely capture snapshots of your working directory, and records them in parallel to your explicitly recorded history.

The snapshot revisions stay out of the way for the most part, but if you need to view them you can look at them, for example using gitx refs/snapshots/HEAD

For me this is primarily useful when I'm sketching out a new project and forgetting to commit anything. When working on a large patch I usually use git commit -a --amend -C HEAD fairly often, which in conjunction with git reflog provides similar safety. However, git snapshot is designed to work well in either scenario.

I have a crontab set up to use mdfind so that all directories with the red label are snapshotted once an hour.

Wednesday, March 3, 2010

KiokuDB Introduces Schema Versioning

I've just released KiokuDB version 0.37, which introduces class versioning.

This feature is disabled by default to avoid introducing errors to existing schemas[1]. To try it out pass check_class_versions => 1 to connect:

    dsn => ...,
    check_class_versions => 1,

To use this feature, whenever you make an incompatible change to a class, also change the $VERSION. When KiokuDB tries to load an object that has been stored before the change was made, the version mismatch is detected (versions are only compared as strings, there is no meaning to the values).

Without any configuration this mismatch will result in an error at load time, but the KiokuDB::Role::Upgrade::Handlers::Table role allows you to declaratively add upgrade handlers to your classes:

package Foo;
use Moose;

with qw(KiokuDB::Role::Upgrade::Handlers::Table);

use constant kiokudb_upgrade_handlers_table => {

    # we can mark versions as being equivalent in terms of their
    # data. 0.01 to 0.02 may have introduced an incompatible API
    # change, but the stored data should be compatible
    "0.01" => "0.02",

    # on the other hand, after 0.02 there may have been an
    # incompatible data change, so we need to convert
    "0.02" => sub {
        my ( $self, %args ) = @_;

        return $args{entry}->derive(
            class_version => our $VERSION, # up to date version
            data => ..., # converted entry data

For more details see the documentation, especially KiokuDB::TypeMap::Entry::MOP.

[1] In the future this might be enabled by default, but when data without any version information is found in the database it is assumed to be up to date.