Saturday, May 9, 2009

OLTP vs. Reporting

A while back FriendFeed described how they successfully use MySQL as an opaque BLOB store.

Their model is very similar to the one used in KiokuDB's DBI backend.

However, if you look at the comments, you'll see that they were totally wrong: they simply don't realize they will need to switch programming languages and start using OLAP very soon ;-)

Fortunately this holy war has been waging long enough that sensible conclusions have already been made. If you think CouchDB is the way of the future then check out MUMPS, or it's slightly more modern descendant, GT.M. Sounds somewhat familiar, doesn't it?

As I see it the distinction between reporting based database usage vs. transaction processing based usage dictates which technology is appropriate to use. Duh, kinda obvious.

OLTP, or Online transaction processing generally involves very targeted queries using explicit lookup keys. The data relevant to a specific transaction is fetched by traversing from these starting points. This traversal usually can't be specified using a single query, nor does it involve any scanning or filtering of data from a collection to extract the relevant bits (both patterns are common for reporting, obviously). The relevant data can be found using joins in a normalized schema, or by walking references in a navigational database.

A concrete example is eBay's bidding process. The database activity is centered around the item being bid on. The related data that is fetched includes images, descriptions, the seller and their history, the bids made on the item, etc. The actual bidding involves data insertion of new bids with new references to the bidder. I can't really imagine a need to use SQL aggregate queries to run an auction. Even if GROUP BY could be used, the data set is usually small enough that it should be just as simple to do it in memory in the frontend, and the un-aggregated data is probably used anyway. The transitive closure of data affecting or affected by the transaction is usually quite small (you never bid on all the items in a given category, for instance).

The aforementioned comment assumes that FriendFeed will need aggregation features some time in the future, and that it's worth their effort right now to make sure they can use them, regardless of what their actual problems and needs really are. This is a classic case of where to apply YAGNI, and I think FriendFeed's programmers have done very well in that respect. Until they stand to make a profit by applying complex reports to their data, there is no reason to expend effort or a software budget in order to use a "real" RDBMS.

Since data usage in OLTP apps tends to have a high locality of reference, it's also easier to reason about data variance. Imagine trying to create a fully normalized relational schema for medical records. You need personal information, medical history, treatments, lookup tables for diseases and drugs (and drug interactions), allergies, an open ended correspondence system for referring to specialists, a table for each type of lab test and a query across all of these tables just to get all the tests a person has had administered. This is just the tip of the ice berg, and it probably constantly changes while needing to retain decades' worth of data. I can easily see why MUMPS and similar systems were and still are used in the health care industry.

By opting for a navigational approach the data doesn't need to be fully normalized. The way in which you find the data can tell you a lot about what is in it and how to use it. Instead of a homogeneous table for test results, individual results are linked from relevant fields and can differ greatly from one another if necessary.

Those of you familiar with relational databases are probably aware that there are many discussions of why the relational model isn't appropriate for this sort of data organization.

It's obviously possible to do OLTP style data processing using a relational model, even if it isn't a natural fit. The examples I gave is probably insane, but thankfully most applications have simpler models than medical records. However, the inverse generally does not hold so well. There are navigational databases with strong reporting features, but reports need to reach inside the data in some common way. This means you have to sacrifice both the opacity and the variance of the data.

As usual, It all boils down to tradeoffs. Are you willing to use relational DBs with navigational paradigms because you need to make heavy use reporting features? Are are you willing to sacrifice reporting features because you want easier storage for your complex model? Or maybe you want to store certain data points in a relational way, but keep the domain objects and metadata in a schema free system so that you can use both as appropriate?

In fact, this is precisely the direction we've been moving in at work. It's usually not worth the effort to store everything in a relational database, it's too much work to intiialize the database for testing, set up the ORM classes for all that auxillary data, and so on. We've usually resorted to instantiating this data on app startup from configuration files, but this makes it hard to link from configuration data to domain objects.

Now we're that using KiokuDB the effort is greatly reduced. When reporting is necessary (generally just one big table of data points), plain SQL works very well. We never need to fetch individual rows from the report table, so we don't have any ORM setup for that. The benefit is that we can store everything in the object storage, without needing to think about it.

I think the case for either approach is well established (and has been for a long time). Picking the right combination on a per project basis just makes sense. If you're a fundamentalist and have already chosen your "side" of this tradeoff, well you can have fun with a "real RDBMS" all you want =)

6 comments:

Anonymous said...

Eh... In the MUMPS Wikipedia page you link to, there is the standard stab at Perl:

> This program sets a value of "x x" to a variable named x, and then launches an infinite recursive execution of x, resulting in stack overflow. At 13 characters, including spaces and an end-of-line mark, the first variant demonstrates that it can be as compact and obscure as such languages as Perl.

It seems that we are the standard for obscurity now.

Best regards,

nothingmuch said...

Yeah, the wikipedia article is a little shoddy in many other ways too. It's obviously dated, and sometimes conveys some really weird opinions, but all in all still informative.

SamV said...

Tangram has supported this relational/heirarchical mix since 2005. The Burroughs/Unisys database,DMS-II, was built in 1972 around the concept.

We're always so proud of our inventions we think that no-one's thought of them before ... :)

nothingmuch said...

Erm, I'm not claiming to have invented this, only that it is a good idea.

John Zabroski said...

SamV,

I can't quite understand from the description of DMS-II what it actually does. In particular, the following quote is worrisome: "It was sometimes claimed that a 'Network Type' Database could be built using the "Manual Subset" capability. That was very risky, since it depended upon "user programs" to add and remove records in manual sub-set."


nothingmuch,

You make some good points, but ultimately your solution is to use Yet Another Database System. The Holy Grail is logical/physical schema separation for a variety of physical data storage needs. Some researchers are actually working on products today where you specify what operations you want to do on the data, and the database is automatically structured for you in a way that will optimize performance.

Jeremy Leader said...

I worked with the implementors of DMS-II at Burroughs and Unisys in the mid to late 80's, let's see if I can remember enough details to add to the discussion.

As I recall, DMS-II was originally an ISAM storage engine, though it was also sometimes described as a network/hierarchical hybrid. It was quite tightly integrated into COBOL, though it could also be used from other languages.

As the relational model became popular, an ANSI-standard SQL layer was built, mostly on top of existing DMS-II (though I think it required a few new features in the core engine as well). This layering is somewhat similar to the way MySQL is built on top of storage engines like MyISAM and InnoDB (though in MySQL's case, the storage engines didn't have an independent existence prior to MySQL).

Around the same time that the SQL layer was being built, there was also an effort to build something called a "Semantic Model" on top of DMS-II. It was essentially an object store (without user-defined methods). That is, it modeled IS-A relationships (inheritance), and HAS-A relationships (one-to-one, one-to-many, many-to-many) directly in the database.