Thursday, June 11, 2009

BerkeleyDB::Manager

Another module I haven't talked about is BerkeleyDB::Manager, a convenience wrapper for BerkeleyDB.

The interface that BerkeleyDB exposes to Perl is pretty close to the C API, which can get very tedious.

For me the hardest part of using Berkeley DB was getting everything properly set up. For example, in order to get transaction support you must pass a number of flags to initialize the various subsystems correctly, open the database with the right options, create transactions using the environment, assign them to the database handles you want them to apply to, and then commit or rollback while checking errors everywhere. Compared to DBI this is torture!

All the configuration is done by twiddling bits using flag constants, and every call must be checked for errors manually. For instance, the proper way to atomically increment a value ($db{$key}++) is:

# create an environment home directory manually
mkdir $home || die $!;

# instantiate a properly configured environment
my $env = BerkeleyDB::Env->new(
    -Home   => $home,
    -Flags  => DB_INIT_TXN|DB_INIT_LOG|DB_INIT_LOCK|DB_INIT_MPOOL|DB_CREATE,
) || die $BerkeleyDB::Error;

my $txn = $env->txn_begin || die $BerkeleyDB::Error;

# open a database using the environment
my $db = BerkeleyDB::Btree->new(
    -Env      => $env,
    -Filename => $db_name,
    -Flags    => DB_CREATE|DB_AUTO_COMMIT,
) || die $BerkeleyDB::Error;

# activate the transaction for a database handle
$txn->Txn($db);

# get a value
my $value;

if ( ( my $ret = $db->db_get( $key, $value ) ) != 0 ) {
    if ( $ret != DB_NOTFOUND ) {
        die $BerkeleyDB::Error;
    }
}

# update it atomically
if ( $db->db_put( $key, $value + 1 ) != 0 ) {
    die $BerkeleyDB::Error;
}

if ( $txn->txn_commit != 0 ) {
    die $BerkeleyDB::Error;
}

Berkeley DB is designed to be used on anything from embedded devices to replicated clusters storing terrabytes of data. The price we pay for this is too many knobs. It supports a very large number of features and most of them are optional and independent of each other: journalling, transactions with varying levels of ACID guarantees, locking concurrency, multiversioning concurrency, threading support, multiprocess support and replication to name the big ones.

However this tradeoff doesn't need to be made every single time, and many of the knobs don't even apply to the Perl, either because it isn't exposed in the Perl bindings or it just doesn't make sense in that environment.

BerkeleyDB::Manager is an object oriented wrapper for Berkeley DB's environment handles. It's basically a factory for database handles.

Out of the box it is configured with that I've come to expect from RDBMSs:

  • Multiprocess safe (locking is enabled by default)
  • Transactions, with autocommit
  • Automatic recovery on startup
  • Deadlock detection

All of these options are configurable as attributes, so you can tweak them if necessary. There are also a few options which are disabled by default (such as log_auto_remove).

BerkeleyDB::Manager also provides convenience wrappers, so the above code would looks something like:

my $manager = BerkeleyDB::Manager->new(
    home   => $path,
    create => 1,
);

my $db = $manager->open_db( $db_name );

$manager->txn_do(sub {
    my $value;

    if ( ( my $ret = $db->db_get( $key, $value ) ) != 0 ) {
        if ( $ret != DB_NOTFOUND ) {
            die $BerkeleyDB::Error;
        }
    }

    if ( $db->db_put( $key, $value + 1 ) != 0 ) {
        die $BerkeleyDB::Error;
    }
});

There is no convenience API for put or get since that would require wrapping everything (cursors and database handles). This might be added as an option later, but even though this part of the BDB API is definitely tedius, at least it's not hard to get right, so it isn't a major concern. Unfortunately the tie interface doesn't do error checking either, so it's scarcely replacement for doing this in your own code.

It's still important to understand how BDB works, by reading the documentation and the C reference. It's really not that hard once you get used to it, and you don't need to remember what everything does, only that it exists. The only exception is log archival. Most people will be happy with log_auto_remove but making that the default makes catastrophic recovery impossible.

2 comments:

Jeremiah said...

Berkeley DB seems to get overlooked often in favor of other databases, despite the fact that it seems to be ubiquitous. What do you see as its strengths and weaknesses?

nothingmuch said...

Its biggest weakness is that it requires more effort (learning and usage) and a certain degree of vendor lock in to be used successfully.

If you're willing to make the investment to learn how to use it and to stick with it it provides many more features and a much greater flexibility than any other similar system (for instance being able to make hot backups is nice and impressive).

Compared to an SQL database it's much simpler so for certain types of data extraction it'll be much lower overhead, but if you need complex queries you need to program their execution manually (and often naive algorithms will perform worse than an SQL database).