Thursday, March 18, 2010

What is a mixed schema?

Yesterday's post is a technical one that says that KiokuDB and DBIx::Class can now be used together on the same schema. What it doesn't explain is what this is actually good for.

Most of the application development we do involves OLTP in one form or another. Some of the apps also do reporting on simple, highly regular data.

KiokuDB grew out of the need to simplify the type of task we do most often. For the reporting side we are still trying to figure out what we like best. For example Stevan has been experimenting with Fey (not Fey::ORM) for the purely relational data.

This approach has been far superior to what we had done before: forcing a loosely constructed, polymorphic set of objects with no reporting requirements into a normalized relational schema that's optimized for reporting applications. There is a also new, worse alternative, which is to run aggregate reports on several million data points as in memory objects with Perl ;-)

However, the two pronged approach still has a major drawback: the two data sets are completely separate. There is no way to refer to data in the two sets without embedding knowlege about the database handles into the domain, which is tedious and annoying.

What the new DBIx::Class integration allows is to bridge that gap.

Concrete Example #1: Mixing KiokuDB into a DBIC centric app

Often times I would find myself making compromises about what sort of objects I put into a relational schema.

There is a tension between polymorphic graphs of objects and normalizd relational schema.

Suppose you're writing an image gallery application, and you decide to add support for YouTube videos. Obviously YouTube videos should be treated as image objects in the UI, they should tile with the images, you should be able to rearrange them, add captions/tags, post comments, etc.

This is precisely where polymorphism makes sense, you have two types of things that being used in a single context, but with a completely different representation. One is probably represented by a collection of files on disk, for the original image, previews, thumbnails, etc, and table entry of metadata. The other is represented by an opaque string ID, and most of its functionality is derived by generating calls to a web service.

How do you put YouTube videos into your image table? Do you add a type column? What about a resource table that has NULLable foreign keys to the image table and a NULLable video_id column? What about a blob column containing serialized information about the data?

With a mixed schema you could create a resource table that has a foreign key to the KiokuDB entries table. You could use the resources table for things like random selection, searches, keeping track of views counts, etc.

I'm going to assume that you're not really interested on running reports on which characters show up most often in the YouTube video IDs or what is the average length of image filenames, so that data can be opaque without compromising any features in your application.

On a technical level this is is similar to using a serialized blob column approach, or some combination of DBIx::Class::DynamicSubclass and DBIx::Class::FrozenColumns.

However, by using KiokuDB these objects become first class citizens in your schema, instead of some "extra" data that is tacked on to a data row. You get a proper API for retrieving and updating real graphs of objects, much more powerful and automatable serialization, a large number of standard modules that are supported out of the box, etc.

Perhaps most importantly, the encapsulation and notion of identity is maintained. You can share data between objects, and that data sharing is reflected consistently in memory. You can implement your MyApp::Resource::YouTubeVideo and MyApp::Resource::Image without worrying about mapping columns, or weird interactions with Storable. That, to me, is the most liberating part of using KiokuDB.

Concrete Example #2: Mixing DBIC into a KiokuDB centric app

On the other side of the spectrum (of our apps, anyway) you'll find data models that are just too complicated to put into a relational schema easily; there are mixed data types all over the place, complex networks of data (we've put trees, graphs, DAGs, and other structures, sometimes all in a single app), and other things that are incredibly useful for rapid prototyping or complicated processing.

This usually all works great until you need an aggregate data type at some point. That's when things fall apart. Search::GIN is not nearly as feature complete as I hoped it would be right now, in fact, it's barely a draft of a prototype. The DBI Backend's column extraction is a fantastically useful hack, but it's still just a hack at heart.

But now we can freely refer to DBIC rows and resultsets just like we can in memory, from our OO schema, to help with these tasks.

One of our apps used a linked list to represent a changelog of an object graph, somewhat similarly to Git's object store. After a few months of deployment, we got a performance issue from a client, a specific page was taking about 30 seconds to load. It turned out that normally only the last few revisions had to be queried, but on that specific cases a pathological data construction meant that over a thousand revisions were loaded from the database and had their data analyzed. Since this linked list structure is opaque, this was literally hitting the database thousands of times in a single request.

I ended up using a crude cache to memoize some of the predicates, which let us just skip directly to the revision that had to be displayed.

With the new features in the DBI backend I could simply create a table of revision containers (I would still need to store revisions in KiokuDB, because there were about 6 different revision types), on which I could do the entire operation with one select statement.

Conceptually you can consider the DBIC result set as just an object oriented collection type. It's like any other object in KiokuDB, except that its data is backed by a much smarter representation than a serialized blob, the underlying data store understands it and can query its contents easily and efficiently. The drawback is that it requires some configuration, and it can only contain objects of the same data type, but these are very reasonable limitations, after all we've been living with them for years.

It's all a bit like writing a custom typemap entry to better represent your data to the backend. In fact, this is pretty much exactly what I did to implement the feature ;-)

This still requires making the effort to define a relational schema, but only where you need it, and only for data that make sense in a relational setting anyway. And it's probably less effort than writing a custom typemap to create a scalable/queriable collection type.


Though still far from perfect, I feel that this really brings KiokuDB into a new level of usefulness; you no longer need to drink the kool aid and sacrifice a powerful tool and methodology you already know.

Even though DBIC is not everyone's tool of choice and has its own drawbacks, I feel that is by far the most popular Perl ORM for a reason, which is why I chose to build on it. However, there's no reason why this approach can't be used for other backend types.

Eventually I'd like to be able to see similar typemaps emerge for other backends. For example the Redis backend could support Redis' different data types, CouchDB has design documents and views, and riak's MapReduce jobs and queries (Franck's backend is on GitHub) could all be reflected as "just objects" that can coexist with other data in a KiokuDB object graph.

No comments: