Friday, May 29, 2009

Immutable Data Structures (cont.)

Blah blah, immutable is great, functionally pure is cool, but isn't it slow with all that copying? Well, duh, if you misuse it. Then again, so is anything else. I've already made the unsubstantiated claim that immutability leads to better code, so in this post I will try to focus on more measurable advantages such as performance.

In 2009 everyone seems to want to scale. Cloud this, cluster that, consistent hashing, and so on. I firmly believe you need to actually have a successful product in order to run into scaling problems. If you are lucky enough to have a real need for scaling then you're probably aware that aggressive caching is a pretty reliable way of improving scalability.

The challenge of caching is making sure the cached data you are using is still valid. If the data has been updated then data which depends on it is now out of date too. This means that updates need to clear (or update) the cache. If there are multiple caches for a single master then these write operations might need to be replicated to all the caches.

Obviously this is no longer a problem if the data you are trying to cache is immutable. The drawback is that the key must change each time it's updated. One could argue that this is the same problem: we're still fetching data from the database (or a cache that is kept up to date) each request. The difference lies in how much data we fetch and how costly or hard it is to fetch it or keep it up to date.

Suppose you're storing images. If you name each file based on the hash of its contents you've just created a stable ETag. Since the URL will encapsulate the ETag, it's valid forever. You can set Expires and Cache-Control to a time when robots will rule the earth. Duplicate files will be consolidated automatically, and there's no need to worry about generating sequence numbers so the data is easy to replicate and change in a distributed setup.

This can be much finer grained, too. For instance to cache parts of a complicated page you can use hashes or sequence numbers from the database as cache keys. Assuming the data is immutable you can do simple app level caching or use these keys to generate ESI identifiers. The implementation of the caching can be adapted very easily to meet your requirements, without requiring massive changes to the way you store your data. There are no issues with the data being out of sync in a single page if you switch to ESI due to scaling considerations, since everything has a stable identifier.

I previously used Git as an example for a project which gains efficiency by using immutability. In Git the only thing that is modified in place is references (branches). Since file and revision data is addressed by its contents (using its SHA1 hash), this 40 byte identifier (well, 20 actually) can be used to describe the full state of a branch no matter how large the data in the branch is. When you run git remote update the references are updated, and new revision data is copied only if necessary. You're still fetching data each time, but the update is very quick if only a few bytes per branch needs to be downloaded, which is the case if nothing has changed. In contrast rsync, which synchronizes mutable data, needs to work a lot harder to compare state.

The guiding principle here is to shift the update operations upwards, towards the root of the data instead of near the leaves. The same techniques apply to web applications as well.

Propagating changes upwards also reduces transactional load on the database. If the data is keyed by its contents or by a UUID you don't need to generate sequence numbers synchronously, or worry about update contention. You can easily replicate and shard the bulk of the data without transactional semantics (all you need is Durability, the D in ACID), while keeping the fully transactional updates comparatively cheap due to their smaller size. Consistency is still guaranteed because once you can refer immutable data, it's always consistent. If the transactional parts need to be distributed eventual consistency is easier to achieve when the bulk of the data is only added, while destructive updates are kept small and simple.

Though this is not really relevant in Perl, there are also benefits to multithreaded code. In a low level language like C where you can royally screw things by accessing shared data without locking, immutable structures provide both speed (no need to lock) and safety (no deadlocks, no inconsistent state) when shared. Ephemeral inconsistency is very hard to reproduce, let alone fix. You obviously still need to take locks on mutable variables pointing to immutable structures, but the data inside the structure can be used lock free. Immutability is also a key part of why STM is so successful in Haskell. If most operations are pure and only a few MVars are susceptible to thread contention then the optimistic concurrency is usually optimal. The overhead of such a high level abstraction ends up being pretty low in practice.

Immutable models are also more resilient to bugs and data corruption. By distilling the model into its most succinct/normalized representation you can express things more clearly, while still easily supporting safe and clean denormalization, without needing to synchronize updates. The "core" data is authoritative, and the rest can be regenerated if needed (even lazily). Assuming you've written tests for your model you can be reasonably more sure of it, too. There is no possibility for action at a distance or sensitivity to the ordering of actions. If the data is correct the first time it will stay correct, so bugs and oversights tend to get ironed out early on in the development of the model.

If you did make a mistake and allowed an invalid structure to be created it's usually also easier to recover the data. This is especially true if you make use of persistence, because you'll have more detail than just a summary of the cause and effect chain.

However, there is also a major concern to be aware of which many people don't anticipate: cleanup. Especially when privacy is an issue, assuming immutability might make things harder. The more you take advantage of data sharing the harder garbage collection becomes, especially in a content addressable keyspace. If you replicate and cache your data deleting it is still easier than updating it correctly, but it's not as easy as just writing it and forgetting about it.

Finally, I'd like to emphasize the distinction between purely functional data structure and a simply immutable one. If an immutable object contains references to mutable data, directly or indirectly, it isn't pure. Most of these benefits assume purity. Impurities can sometimes take out the pain of making everything immutable by letting you cut corners, but it's tempting to go too far due to short sighted laziness and lose out on everything in the long run.

At the end there's usually an obvious choice between a mutable or an immutable design for a particular problen, but mutability seems to be the default for premature optimization reason. Unless you are absolutely sure immutable data will be really slow and wasteful in the big picture, and you also know how much slower it will be, there's usually no good reason to prefer mutability.

In the next post on this subject I will go over some techniques for easily working with immutable data structures. Updating immutable data by hand can be a pain in the ass, but it rarely has to be done that way.

Update: a low level example from OCaml.

Monday, May 25, 2009

Become a Hero Plumber

Perl's reference counting memory management has some advantages, but it's easy to get cycle management subtly wrong, causing memory and resource leaks that are often hard to find.

If you know you've got a leak and you've narrowed it down then Devel::Cycle can be used to make sense out of things, and Test::Memory::Cycle makes it very easy to integrate this into unit tests.

Harder to find leaks are usually the result of combining large components together. Reading through thousands of lines of dumps is pretty impractical; even eating colored mushrooms isn't going to help you much.

For instance, this is a classic way to accidentally leak the context object in Catalyst:

sub action : Local {
    my ( $self, $c ) = @_;

    my $object = $c->model("Thingies")->blah;

    $c->stash->{foo} = sub {
        $object->foo($c);
    };

    $c->forward("elsewhere");
}

That action will leak all the transient data created or loaded in every request. The cyclical structure is caused by $c being captured in a closure, that is indirectly referred to by $c itself. The fix is to call weaken($c) in the body of the action.

This example is pretty obvious, but if the model was arguably cleaner and used ACCEPT_CONTEXT to parameterize on $c, the leak would be harder to spot.

In order to find these trickier leaks there are a few modules on the CPAN that can be very helpful, if you know how and when to use them effectively.

The first of these is Devel::Leak. The basic principle is very simple: it makes note of all the live SVs at a given point in your problem, you let some code run, and then when that code has finished you can ensure that the count is still the same.

Devel::Leak is handy because it's fairly predictable and easy to use, so you can narrow down the source of the leak using a binary search quite easily. Unfortunately you can only narrow things down so far, especially if callbacks are involved. For instance the Catalyst example above would be hard to analyze since the data is probably required by the views. The smallest scope we can test is probably a single request.

Devel::Gladiator can be used to write your own more detailed Devel::Leak workalike. It lets you enumerate all the live values at a given point in time. Just be aware that the data structures you use to track leaks will also be reported.

Using Devel::Gladiator you can also find a list of suspicious objects and then analyze them with Devel::Cycle quite easily.

Sometimes the data that is leaking is not the data responsible for the leak. If you need to find the structures which are pointing to a leaked value then Devel::FindRef can be very helpful. The hardest challenge is picking the right value to track, so that you can get a small enough report that you can make sense of it.

Devel::Refcount and Devel::Peek can be used to check the reference count of values, but remember take into account all the references to a given value that are also in the stack. Just because the ref count is 2 for a value that's supposed to be referred to once does not mean that it's the root of a cyclical structure.

A more managed approach is using instance tracking in your leaked classes, ensuring that construction and destruction are balanced on the dynamic scope. You can do this manually for more accurate results, or you can use something like Devel::Events::Objects. I personally dislike Devel::Leak::Object because you have no control over the scope of the leak checking, but if you're writing a script then it might work for you.

Lastly, if you suspect you've found a leak then Data::Structure::Util is a rather blunt way of confirming that suspicion.

Sunday, May 24, 2009

Immutable Data Structures

I doubt anyone could make a case against simplicity in software. But all problems have some inherent complexity which cannot be removed. Everything we do will end up adding to this. I think that the real art of programming is to finding a solution that adds the least amount of complexity possible. Immutable data structures are a good way of trading complexity in one place for complexity in another, but unfortunately they are not that popular in the Perl world.

Perl is a very rich language. It comes with a very wide spectrum of styles to choose from. This is very much a part of Perl's culture, too, and obviously the fact that there are many different language primitives means that the number of ways to combine them is much larger. But TIMTOWDI is a mixed bag, there is usually not more than one good way to do something, and which one it is depends on the context.

People usually contrast this with the zen of Python, that is that there should only be one way to do something. I think a better counter example is purely functional programming. Compared to Perl, Python is still a rich and complex language. Its one true way of doing anything is based on style and opinion, it isn't a real necessity inherent in the language's structure.

The gist of purely functional languages (with pure being the key word, not functional) is that data never changes. There are no update operations at all. Instead of modifying data in place, you must make a copy of it, overriding the values you want to change.

Of course, we can do this in Perl too:

use MooseX::Declare;

class Person {
    has name => (
        isa => "Str",
        is  => "ro",
    );
}

Suppose you have a Person object and you'd like to update the name attribute, the common way to do this would be:

$person->name("Hefetz");

However, since $person is an immutable object (the name attribute is marked ro) you would need to do something like this instead:

my $renamed = $person->clone(
    name => "Hefetz",
);

Likewise, any object pointing at the person would have to be cloned to have its copy of $person replaced with $renamed, or it will not reflect the change.

This is not a common practice in Perl because it often seems easier to just update values in place, but in a purely functional way this is the only way to update data.

To a typical Perl programmer this would seem like a much more complicated way of handling the data, but really it's actually a pretty powerful tradeoff. Though the extra copying is an added complexity, operations that take this immutable data as inputs have a much simpler set of rules to follow on how the data may be used.

One such benefit is never having to recompute anything. This is called referential transparency. It's the property that applying a function to the same arguments will always return the same result. I believe that the real benefit lies not in potential efficiency gains (though that's is another plus), but in the fact that you can make assumptions about the data in your code, and these assumptions are pretty resistant to changes in the code as well.

Let's say you realize that a data field must suddenly become updatable. While there is a an undeniable amount of effort involved in restructuring everything to support multiple copies of the field's container, once that's done it's done. That code will not come back to haunt you because of this change.

Instead of needing to update dependent values you recreate them in relation to the updated copy. The old data is still usable (and will always remain valid). This may sound like a lot of work, but it's much easier to keep one version of an algorithm, instead of two (one to create and another to update).

This may seem limiting at first, but once you get used to working this way it's an excellent tool for identifying which parts of the data should and shouldn't change (and if they should, when and how), allowing you to model the data with more clarity. In the long run this is a big win for simplicity. The code might be solving a complex problem, but it's simpler to adapt and reuse.

One thing to keep in mind is that just because you use immutability to your benefit it doesn't mean you need to use it all the time. Before this change in KiokuDB, if the store operation resulted in an error then some objects would be left registered in the live object set even though they hadn't actually been stored. Looking for and removing these objects would have been very hard but obviously the right thing todo.

The easy solution was to treat the live object set as immutable for the duration of a single store operation. All the operations are made on a temporary buffer object. If everything completed successfully then the buffer's changes are committed to the live object set at the end. I lacked the foresight to do this from the start because the set of live objects is inherently mutable. The extra layer of indirection seemed like added complexity (every lookup on the live object set would have to check the buffer first), but in the end I think it was a definite improvement.

An example of a much bigger system leveraging immutability is Git. When writing a version control system you must obviously keep full history of the data. Most systems have approached this as a problem that needs to be overcome. Git is somewhat unique in that it takes advantage of this property in the model's design. By using a simple snapshot based model instead of a complicated delta based one it's able to provide better performance, safety, security and flexibility.

Casual observers initially criticised git for having a model so simple it was actually naive. It turns out they were confusing the model with its on disk representation. Git makes this distinction very well, and the result is that it implements powerful features (for instance idempotent patch application) which are apparently too complicated in other systems. Git itself isn't simple at all, the problem of version control is a complicated one so any system dealing with it is inherently complex. Git's advantage is that it's built on a very simple and future proof core, allowing the complex parts to evolve more easily.

Anymoose, this post getting pretty long, so I think I will put off more technical details for a later date. This should make a nice series of posts. I think it should also help me to write my talk at YAPC::NA this year.

In a future post I'll probably try and discuss the implications (positive and negative) of using immutable data structures in real code.

If you want some good reading material for this topic, two of my favourite programming books are Purely Functional Data Structures and Algorithms: A Functional Approach. Both are surprisingly short but cover a lot of ground. A good introduction to this whole approach is the article Worlds: Controlling the Scope of Side Effects.

Friday, May 22, 2009

Devel::STDERR::Indent

The next module in the modules I haven't talked about series is Devel::STDERR::Indent. This is a simple utility module for indenting tracing output.

If you're doing high level tracing with warn then every call to warn invokes $SIG{__WARN__}. Devel::STDERR::Indent wraps this hook and indents the output according to its current level of nesting. This is especially handy for recursive code, where the same trace message is emitted for different parts of the flow.

To raise the indentation level you create a guard:

use Devel::STDERR::Indent qw(indent);

sub foo {
    my $h = indent();
    warn "in foo";
}

For as long as $h is in scope, the indentation level will be one level deeper.

Thursday, May 21, 2009

Graceful degrading of gists

I've written a small bit of jQuery based code to take <pre> tags with certain markers and upgrade them in place into pretty embedded gists. The idea is that the tag int the blog post is a <pre> instead of a <script>.

Unfortunately you may have noticed my blog was supposedly in violation of the TOS for the past few hours. I suspect this is due to adding this script. It has now been unflagged but if you'd like to use it be aware that Blogger didn't seem to like it much at first.

When you want to embed a gist paste in an HTML document you are provided with a <script> tag that has two document.write calls, one to add a stylesheet and the other to add a <div> tag for the actual paste. Usually you'd then want to complement this with a <noscript> tag containing a <pre> with a copy of the code so it displays in aggregators or systems with javascript disabled.

Here's how I embed a gist now:

<pre class="fake-gist" id="fake-gist-115368"><code>
use Moose;

has fun => ( isa => "Constant" );
</code></pre>

My script iterates all the elements of the class fake-gist, wraps them with the same styling as the gists, and then fetches the actual gist with syntax highlighting. When the gist has been fetched document.write is trapped and instead of appending the gist to the end of the document it calls a replaceWith on the <pre> tag.

This also makes the page load much faster, since the fetching of gists is no longer synchronous.

Feel free to download the script and get banned from Blogger yourselves. You'll also need to add github's embed.css and jQuery 1.2 or newer.

Wednesday, May 20, 2009

Thoughts on Wolfram Alpha

Wow, just 20 days into this blogging thing and already I'm writing wanker posts. I guess I understand how this can happen now. It's tempting to write this shit down because there's the hope that some random person will come in with a refreshing idea/response. I just hope it'll be better than "DIE IN A FIRE RTARD".

So anyway, the idea behind Wolfram Alpha is that it "understands" both your questions and the data it's using to answer them. The obviously disappointing thing is that it achieves this goal (or rather emulates it) using a sisyphic process of manual input, as provided for by the admirable but ultimately very limited efforts of Wolfram Research. They aggregate the data from a variety of sources, but their focus is on quality, not quantity, so there is lots of human intervention. Once I got past the novelty of this shiny new toy, it quickly became quite boring. An impressive technical feat, but not earth shattering by any means. The types of data it knows about are rather arbitrary (anything "computable"), and though the various demos show an impressive amount of detail and structure, the big picture is both sparse and dull. I can't think of many interesting questions whose answer involves historical weather data or telling me what day of the week a certain date was. It sort of reminds me of savant syndrome. Answers to interesting questions require mental leaps, not just the retrieval of dry factual data.

I don't think economy of scale applies here, either. It's hard to imagine Wolfram Alpha being twice as interesting/useful (and thus potentially twice as profitable) by having twice as much data. A thousand times more data is where it's at. The project's downfall has already been foreshadowed; the semantic revolution has failed to happen for quite some time now, largely due to its reliance on manual annotations and relative uselessness on a small scale (billions of triples is still quite small). There is just too much unstructured data for us to go back and fix as a civilization. The web embodies a small subset of our potentially machine readable knowledge, and we've failed at that. The effort required to make Wolfram Alpha truly useful for getting information that cannot be found in traditional sources is colossal. Without being able to actually access all this wealth of information any automated system is still just an almanac with a search box, even if it's a very clever search box.

Comparatively, for a human, the task of extracting meaningful information from data is trivial. The reason we so want this technological breakthrough is that humans don't scale well. Wolfram Alpha's "understanding" is limited to what Wolfram Research's staff has fed into it. I don't believe they can succeed where so many others have failed without trying something radically different. They seem better motivated and goal oriented (having a product, as opposed to having a warm fuzzy feeling about the potential of machine readable data), but I don't think that this is enough for a real departure from the state of the art as of May 14th, 2009.

A slightly more interesting community driven project is Freebase. Freebase also automates input from various sources, but it is more lax, relying on continual improvement by the community. It also employs some clever tactics to make the process improving the data fun. It doesn't have Wolfram alpha's free form text queries, but I think it's more interesting because the data is open, editable and extensible. And yet my life still remains to be changed due Freebase's existence.

I think the real answer will likely come from Google. How cliché, I know. But consider their translation services. By leveraging the sheer volume of data they have, they are able to use stochastic processes to provide better translation, spanning more language pairs than other systems. Supposedly "smart" systems hard coded with NLP constructs generally produce inferior results. So is Google Squared the future?

At least from what I've seen on the 'tubes Google Squared is not quite the holy grail either. It knows to find data about "things", and dices it up into standalone but related units of data using user fed queries. Google is encouraging the adoption of light weight semantic formats such as RDFa and microformats, but I think the key thing is that Google Squared don't seem to rely on this data, only benefit from it. This difference is vital for a process of incremental improvement of the data we have. If it's already useful and we're just making it better by adopting these formats, we get to reap the benefits of semantic data immediately soon, even if the semantic aspects aren't perfect.

But the real interesting stuff is still much further off. Once Google Squared can provide richer predicates than "is", "has" or a vague "relates to" the set of predicates itself becomes semantic data. This is not a new idea. In fact, this concept is a core part of RDF's design (predicates are also subjects). What would be really interesting is to see if this data set, the meta model of the relationships between the "things" that Google Squared currently knows about, could be generated or at least expanded using data mining techniques from the data it describes. Imagine if you will the manual effort of choosing which data goes into Wolfram Alpha, and providing new ways of combining this data becoming automated.

Another key part of using stochastic processes to make something useful is integrating positive feedback. Google's ubiquity is a clear advantage here. Compared to Wolfram Alpha's offerings Google has orders of magnitude more data and orders of magnitude more potential for feedback to improve the way this data is processed.

There's also a financial reason for believing Google will make this happen. I hate advertisements because I don't want to buy most of that stuff. I see mostly ads that are wasting both my time the advertisers' money. I think these sentiments are shared by many people, and yet Google makes a lot of its money by delivering slightly less annoying ads than its competitors. And Google is damn profitable. Within these imaginary technologies lies a lot of potential for ads that consumers would actually want to see. I think this pretty much guarantees an incentive for work in the field.

Anyway, here's hoping that semantic revolution will eventually happen after all. My money says that success will come from extracting new meaning from all that data, not by merely sifting through it for pieces that are already meaningful. Who knows, maybe it'll even happen in the next few decades ;-)

Modeling identity with KiokuX::User

If you're developing a KiokuDB based application then you probably already know about KiokuX::User. Here is a useful technique to keep your model flexible when you use it: instead of consuming the role in your high level user class, consume it in a class that models identity:


package MyFoo::Schema::Identity;
use Moose::Role;

has user => (
    isa      => "MyFoo::Schema::User",
    is       => "ro",
    required => 1,
);

package MyFoo::Schema::Identity::Username;
use Moose;

with qw(
    MyFoo::Schema::Identity
    KiokuX::User
);

MyFoo::Schema::User represents the actual user account, and any object doing the MyFoo::Schema::Identity role is an identity for such a user.

Keeping the two separated will allow you a number of freedoms:

  • Accounts can be consolidated or migrated easily
  • You can add support for additional authentication schemes
  • You can rename users but keep the usernames as primary keys (needed for backends for which uniqueness constraints cannot be specified)

Obviously you should also keep a set of identities in each user object.

Tuesday, May 19, 2009

Choosing an OpenID provider

Now that you have a pretty OpenID with the full freedom to change providers on a whim, here are a few tips for what to look for when choosing one.

In my opinion the most important issue is phishing protection. At least to me an SSO login is much more valuable than a throwaway password for a random site.

Since the site you log into redirects you to your OpenID server, the authentication sequence is vulnerable to phishing attacks. The malicious site could actually send you to a proxy page which will attempt to steal your credentials.

There are ways of protecting yourself as a user, simply never type your credentials into that page (always open a new window, login there, and then reload the other), but it's easy to forget when you see a familiar looking page.

Some OpenID providers can display an image or text banner that depends on a cookie in your browser. That way if you don't see the image something suspicious is going on (the proxy will not receive the cookie from your browser).

Most providers can provide better authentication methods. My current favourite is an SSL certificate. This means no sensitive information is sent over the wire at all. It's not only more secure, but also quicker and more convenient.

If you're concerned about security, make sure your provider has decent logging for all activity in your account.

The next thing to look for is support for multiple persona support. When you log into a website your provider will some profile information along with the authentication token. If you want to use separate emails or languages settings for a certain website then your OpenID provider will need to allow you to pick which set of values to send.

Lastly, some OpenID providers and consumers are broken/out of date. It seems that ideally you'd want one that supports OpenID 2.0, but also version 1. I had quite a bit of trouble with Movable Type as an OpenID consumer, until I finally settled on myOpenID. I'm happy to say that it fulfills all my other requirements too.

Saturday, May 16, 2009

Github's Fork Queue

I keep my Perl code in Git, and I've been using Github to host it. Github significantly lowers the entry barrier towards contributing to opensource projects. I really love it so far. But I do have one problem, the only way to apply commits using its fork queue feature is actually the equivalent of running git cherry-pick --signoff.

When you use git cherry-pick it effectively clones the commit by reapplying the diff, and overrides some of the commit metadata. The commit you cherry-picking and the commit you end up applying are become two separate history tracks as far as Git is concerned, even if there could be a fast forward merge.

When your contributer tries to sync up again, they will probably run git pull (I'll post some other time on why I think that's a bad habit), and end up merging their version of the patch with your version of the patch.

This can lead to very confusing history, especially when they want you to pull again and you have a mishmash of patches to sort through. Making things worse, many people who contribute using Github do so because it's so easy. By cherry picking their commits instead of just merging them you are making it hard for them, they will probably have to run git reset --hard origin/master to make things right.

So, in the interest of sanity, unless you actually mean to cherry pick, please always use git merge:


% git remote add some_user git://github.com/some_user/repo.git
% git remote update
Updating origin
Updating some_user
From git://github.com/some_user/repo
 * [new branch]      master   -> some_user/master
% gitx HEAD some_user/master
% git merge some_user/master
Updating decafbad..b00b1es
Fast forward
 lib/Some/File.pm |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

The gitx or gitk invocation lets you compare the two branches, viewing the commits you are about to merge before you actually apply them. I also really like using gitx --all to compare all the branches of a tree. The --left-right option is useful for comparing divergent branches. Note that all of these options are also usable with plain git log.

If you do insist on cherry picking, make sure to tell your contributer what to do afterwords (e.g. git reset --hard origin/master, possibly preceded with git branch rejected_work to keep their rejected patches in their own branch). Otherwise their unapplied work will continue to haunt them as they try to work on the new version.

Hopefully Github will implement a feature in the fork queue allowing you to fast forward to a revision, making it easy to do handle this much more common case of merging.

Lastly, since I can't resist nitpicking I'd also like to add that by convention Signed-off-by means (at least in the context of git and linux) that the committer is making some sort of guarantee about the copyright status of the contribution, either giving rights if they're the author, or claiming that they received copy rights from the contributer whose patch they are applying. If you to interpret the Signed-off-by that way, then the fork queue is essentially forging a signoff from you every time you use it. It could be argued that it's meaningless, but then why do it at all? It seems the intent was really more like an Acked-by line. I think that adding a signoff should only be done if a checkbox that is disabled by default has been ticked by the user.

Your OpenID sucks

Now that OpenID is finally picking up I keep seeing people use lame URLs like http://username.myopenid.com/ to authenticate. This sucks because:

  • You are a unique snowflake!
  • It ties your identity to your OpenID provider.
  • It's only as permanent as your chosen provider (or your patience for it). You can't switch providers while keeping your existing ID.

Furthermore, this profile page usually just contains a single link forwarding to a user's homepage.

So instead of settling for an ugly URI, just use your existing homepage. There's no need to do any complicated set up or install OpenID software, because OpenID supports delegation natively.

Open up your OpenID provider's profile page and copy the OpenID related link and meta tags. On my myopenid page it looks like this:


<meta http-equiv="x-xrds-location" content="http://nothingmuch.myopenid.com/?xrds=1" />

<link rel="openid.server"    href="http://www.myopenid.com/server" />
<link rel="openid2.provider" href="http://www.myopenid.com/server" />

Paste that into your homepage, and add the following:


<link rel="openid.delegate" href="http://nothingmuch.myopenid.com/" />
<link rel="openid2.local_id" href="http://nothingmuch.myopenid.com/" />

Obviously the href of the delegate link should point to your own OpenID provider's profile page.

This lets me use a URL that is truly my own, http://nothingmuch.woobling.org/, as a fully functioning OpenID. I didn't have have to install or configure anything. This also allows me freely switch providers while retaining my chosen identity, all OpenID authentication really needs to prove for authentication is that the user entering the URL is also in control of the URL, making providers swappable.

Setting up proper Yadis/XRDS discovery headers is left as an excercise for the user. I was lazy and only used a meta tag ;-)

Friday, May 15, 2009

Why I don't use CouchDB

CouchDB is lots of fun. It's really easy to install on a mac using the CouchDBX package. It comes with a nice web UI so you can play around with it straight away. It leverages REST and JSON to provide a simple API that you can use from virtually any language. It has a great transactional model which lets you have full ACID semantics in a very lightweight way. So why don't I use it? Well, several reasons. I'll try to skip the standard flaming I've heard on the 'tubes before. Here goes…

Views

The concept of CouchDB's views is actually very elegant. It's a purely functional map of the documents in the database. This means you can process the data any way you like using javascript, but CouchDB can make assumptions about data freshness, and safely cache and index the results to provide efficient queries and updates (at least in theory ;-).

Unfortunately you can only create a view from the original data, there is no way to create views whose input is other views. This means that you cannot do anything really interesting with values from multiple documents. You can aggregate data from several documents using the reduce functionality into buckets, but you can't process that data further.

This means that you have to live with the same limitations of SQL queries (the fact that they are non recursive, so they can't express transitive relationships), but you don't get the freedom to write queries ad hoc and have them execute efficiently (ad hoc views are supported, but there are no general purposes indexes).

The reduce functionality alleviates this somewhat, but personally I feel this is a bit of a kludge (reduce is really just a special case of map: map takes data and outputs data to buckets using a key. Reduce is a map where the data is the resulting buckets from the previous pass and the output).

The overview implies that CouchDB contains a port of Google's MapReduce framework, but the "real" MapReduce is much more flexible than CouchDB's implementation.

Replication

The replication subsystem is also heavily hyped, but it's hard to find details about the way it actually works. My understanding is that each conflicting version is kept in storage, but that one of these "wins" and becomes the default version of a document. This is rationalized in the CouchDB technical overview as follows:

The CouchDB storage system treats edit conflicts as a common state, not an exceptional one

If I understand correctly since a conflict is not an error, without explicitly seeking out these conflicts your keep working with the "winner". From the user's point of view if your application is not defensive about conflicts but the user decides to deploy it with replication it could lead to apparent data loss (the data is still there, but not viewable in the application) and inconsistencies (if two different documents' "winners" have a conflicting assumption about the state of the database, without actually conflicting in the data, though if fully serializable transactions are used this might not be an issue).

In short, color me skeptical. The replication subsystem could be a useful start to building distributed apps, but there is still a lot of effort involved in doing something like that.

Out of the box replication support is useful for taking data sets home on your laptop as a developer, and being able to push changes back later. I see no compelling evidence for the claims about scalability and clustering.

To me this seems like a niche feature, not really relevant for most applications, but one in which significant effort was invested. The presence of a feature I don't quite care for doesn't really mean I shouldn't use something, but for a project which is still under heavy development this comes at the expense of more important features.

Latency

If I recall correctly CouchDB supports upwards of 2000 HTTP requests per second on commodity hardware, but this is only optimal if you have many concurrent dumb clients, wheras most web applications scale rather differently (a handful of server side workers, not thousands).

Even if you use non blocking clients the latency of creating a socket, connecting, requesting the data and waiting for it is very high. In KiokuDB's benchmarks CouchDB is the slowest backend by far, bested even by the naive plain file backend by a factor of about 2-3, and by the more standard backends (Berkeley DB, DBI) by a factor of more than 10. To me this means that when using KiokuDB with Berkeley DB backend I don't need to think twice about a request that will fetch several thousand objects, but if that request takes 5 seconds instead of half a second the app becomes unusable. Part of the joy of working with non linear schemas is that you can do more interesting things with tree and graph traversals, but performance must be acceptable. Not all requests need to fetch that many objects, but for the ones that do CouchDB is limiting.

If you have data dependencies, that is you fetch documents based on data you found in other documents this can quickly become a bottleneck. If bulk fetching and view cascades were supported a view that provides a transitive closures of all relevant data for a given document could be implemented by simply moving the graph traversal to the server side.

So even though CouchDB performs quite when measuring throughput it's quite hard to get low latency performance out of it. The simplicity gained by using HTTP and JSON is quickly overshadowed by the difficulties of using nonblocking IO in an event based or threaded client.

To be fair a large part of the problem is probably also due to AnyEvent::CouchDB's lack of support for the bulk document API's include_docs feature (is that a recent addition?). KiokuDB's object linker supports bulk fetching of entries, so this could have the potential to make performance acceptable for OLTP applications requiring slightly larger transient data sets. Update: this has since been added to AnyEvent::CouchDB. I will rerun my benchmarks and post the results in the comments tomorrow.

No authentication or authorization

Authorization support could make a big performance difference for web applications. If the mechanisms to restrict access were in place the CouchDB backend could be exposed to the browser directly, removing the server side application code as a bottleneck.

If the server side could provide the client with some trusted token allowing it to only view (and possibly edit) a restricted set of documents. There is lots of potential in the view subsystem for creating a flexible authorization framework.

This would also make CouchDB a serious platform for writing pure javascript applications, without needing a fully trusted sandbox environment. If all you need is CouchDB and static files then deploying your app would be a breeze.

LDAP authentication is on the roadmap, but authentication and authorization are really separate features, and there doesn't seem to be any work toward flexible access control yet.

Apparent lack of development focus

I guess I have no business complaining about this since I don't actually contribute code, but it seems to me like the focus of the team was to improve what already exists, instead of adding important missing features (or at least features I feel are important). This makes me pessimistic about having any of the issues I raised resolved.

When I was last on the IRC channel there were discussions of a second rewrite of the on disk BTree format. Personally I would much rather see feature completeness first. Rewriting the on disk format will probably not provide performance improvements an order of magnitude better than the current state, so I think it's more than acceptable to let those parts remain suboptimal until the API is finalized, for instance. CouchDB's performance was definitely more than acceptable when I was using pre-rewrite, so this strikes me as a lack of pragmatism and priorities, especially when the project does have an ambitious roadmap.

Alternatives

We've been using the excellent Berkeley DB as well as SQLite and unfortunately MySQL for "document oriented storage", and all of these work very well. Connectivity support is fairly ubiquitous, and unlike CouchDB the APIs are already stable and complete.

Other alternatives worth exploring include MongoDB (which unfortunately lacks transactions), key/value pair databases (lots of these lately, many of them distributed), RDF triplestores, and XML databases.

One alternative I don't really consider viable is Amazon SimpleDB. It exhibits all of the problems that CouchDB has, but also introduces complete lack of data consistency, and a much more complex API. Unless you need massive scaling with very particular data usage patterns (read: not OLTP) SimpleDB doesn't really apply.

I think the most important thing to keep in mind when pursuing schema free data storage the "you are not google" axiom of scaling. People seem to be overly concerned about scalability without first having a successful product to scale. All the above mentioned technologies will go a long way both in terms of data sizes and data access rates, and by using a navigational approach to storing your data sharding can be added very easily.

Anyway, here's hoping CouchDB eventually matures into something that really makes a difference in the way I work. At the moment once the store and retrieve abstractions are in place there's nothing compelling me to try and use it over any other product, but it does show plenty of promise.

Wednesday, May 13, 2009

Devel::StringInfo

The next installment in the modules I haven't talked about series is Devel::StringInfo.

Devel::StringInfo collects information about a string to determine what encoding it is in, what other encodings it could be in, and what unicode string would it be if reinterpreted as such.

Encoding confusion usually happens because Perl stupidly assumes the default encoding for all undecoded strings is Latin-1, so when combining a string of bytes which are valid UTF-8 data with a with a Unicode character string, the bytestring is decoded as Latin-1 instead of UTF-8 as most people expect. Since virtually any byte sequence is valid Latin-1 this is a silent conversion whose side effects are usually observed very far away. To make things worse, when printing out Unicode strings without an explicit conversion, they are encoded as UTF-8, which means the data will not survive a round trip.

miyagawa's Encode::DoubleEncodedUTF8 module can be used to work around this problem, but you are better off identifying the cause and fixing it.

By using Devel::StringInfo to gather information about your strings you can identify byte strings that should be decoded (is_utf8 is false, but the strings appears to have UTF-8 data in them). When concatenating suspect strings together test both the inputs and the resulting string.

Perl's Unicode handling is very confusing because of the relationship between ASCII, UTF-8 and ISO 8859-1 (Latin-1). These encodings all overlap for the bottom 127 code points, so unless you are using strings in a language other than English your code might be wrong but appear to be working correctly.

In my opinion the best solution is to always decode as early as possible, and encode as late as possible. binmode($fh,":utf8") binmode($fh,":encoding(utf8)") is handy for this (Update: see discussion in comments and also read this page on utf8 and PerlIO). Also try and keep your data encoded in either UTF-8 or ASCII. Any handling of other encodings should be clearly and obviously marked in the source code, and decoded into Unicode strings as early as possible. Usually you are not processing binary data. It's up to you to tell Perl which data is actually text.

A few more notes:

  • Don't forget to use utf8 when you have UTF-8 encoded string literals in your source code. The default encoding for Perl source code is unfortunately Latin-1.
  • Read and understand perlunitut. There is no magical way to cargo cult something and end up with working code. You must know how Perl treats your data.
  • Remember that perl implicitly decodes when you combine string and binary data and implicitly encodes when you print to a filehandle.
  • encoding::warnings will warn you if you implicitly decode data from bytes to unicode characters, but you need to remember to use it anywhere you handle strings.
  • For more advice see Juerd's perluniadvice page.

Saturday, May 9, 2009

OLTP vs. Reporting

A while back FriendFeed described how they successfully use MySQL as an opaque BLOB store.

Their model is very similar to the one used in KiokuDB's DBI backend.

However, if you look at the comments, you'll see that they were totally wrong: they simply don't realize they will need to switch programming languages and start using OLAP very soon ;-)

Fortunately this holy war has been waging long enough that sensible conclusions have already been made. If you think CouchDB is the way of the future then check out MUMPS, or it's slightly more modern descendant, GT.M. Sounds somewhat familiar, doesn't it?

As I see it the distinction between reporting based database usage vs. transaction processing based usage dictates which technology is appropriate to use. Duh, kinda obvious.

OLTP, or Online transaction processing generally involves very targeted queries using explicit lookup keys. The data relevant to a specific transaction is fetched by traversing from these starting points. This traversal usually can't be specified using a single query, nor does it involve any scanning or filtering of data from a collection to extract the relevant bits (both patterns are common for reporting, obviously). The relevant data can be found using joins in a normalized schema, or by walking references in a navigational database.

A concrete example is eBay's bidding process. The database activity is centered around the item being bid on. The related data that is fetched includes images, descriptions, the seller and their history, the bids made on the item, etc. The actual bidding involves data insertion of new bids with new references to the bidder. I can't really imagine a need to use SQL aggregate queries to run an auction. Even if GROUP BY could be used, the data set is usually small enough that it should be just as simple to do it in memory in the frontend, and the un-aggregated data is probably used anyway. The transitive closure of data affecting or affected by the transaction is usually quite small (you never bid on all the items in a given category, for instance).

The aforementioned comment assumes that FriendFeed will need aggregation features some time in the future, and that it's worth their effort right now to make sure they can use them, regardless of what their actual problems and needs really are. This is a classic case of where to apply YAGNI, and I think FriendFeed's programmers have done very well in that respect. Until they stand to make a profit by applying complex reports to their data, there is no reason to expend effort or a software budget in order to use a "real" RDBMS.

Since data usage in OLTP apps tends to have a high locality of reference, it's also easier to reason about data variance. Imagine trying to create a fully normalized relational schema for medical records. You need personal information, medical history, treatments, lookup tables for diseases and drugs (and drug interactions), allergies, an open ended correspondence system for referring to specialists, a table for each type of lab test and a query across all of these tables just to get all the tests a person has had administered. This is just the tip of the ice berg, and it probably constantly changes while needing to retain decades' worth of data. I can easily see why MUMPS and similar systems were and still are used in the health care industry.

By opting for a navigational approach the data doesn't need to be fully normalized. The way in which you find the data can tell you a lot about what is in it and how to use it. Instead of a homogeneous table for test results, individual results are linked from relevant fields and can differ greatly from one another if necessary.

Those of you familiar with relational databases are probably aware that there are many discussions of why the relational model isn't appropriate for this sort of data organization.

It's obviously possible to do OLTP style data processing using a relational model, even if it isn't a natural fit. The examples I gave is probably insane, but thankfully most applications have simpler models than medical records. However, the inverse generally does not hold so well. There are navigational databases with strong reporting features, but reports need to reach inside the data in some common way. This means you have to sacrifice both the opacity and the variance of the data.

As usual, It all boils down to tradeoffs. Are you willing to use relational DBs with navigational paradigms because you need to make heavy use reporting features? Are are you willing to sacrifice reporting features because you want easier storage for your complex model? Or maybe you want to store certain data points in a relational way, but keep the domain objects and metadata in a schema free system so that you can use both as appropriate?

In fact, this is precisely the direction we've been moving in at work. It's usually not worth the effort to store everything in a relational database, it's too much work to intiialize the database for testing, set up the ORM classes for all that auxillary data, and so on. We've usually resorted to instantiating this data on app startup from configuration files, but this makes it hard to link from configuration data to domain objects.

Now we're that using KiokuDB the effort is greatly reduced. When reporting is necessary (generally just one big table of data points), plain SQL works very well. We never need to fetch individual rows from the report table, so we don't have any ORM setup for that. The benefit is that we can store everything in the object storage, without needing to think about it.

I think the case for either approach is well established (and has been for a long time). Picking the right combination on a per project basis just makes sense. If you're a fundamentalist and have already chosen your "side" of this tradeoff, well you can have fun with a "real RDBMS" all you want =)

Friday, May 8, 2009

Deployment branches with Git

When you deploy code to different machines often times you need to add little tweaks, such as adding or modifying configuration files.

If you use Git to manage your project you can create one deployment branch per server:


$ git clone git://my/project.git
$ cd project
$ git checkout -b $(hostname) origin/master

Then perform any changes you want and commit them.

When it's time to update the deployment run:


$ git pull --rebase

to rebase the branch against the new state of the upstream branch.

This will replay your per-machine changes on top of the new head revision, without needing to push them back into the main branch.

If you want to know what's going on before you actually pull, git remote update followed by git status will tell you what the state of the deployment branch compared to the upstream is.

Even nicer: we don't use origin/master as the remote branch, but rather some other stable branch, which we merge master into when we deploy. This simplifies the process of making minor fixes: you can do them on the stable branch and merge into master instead of doing them on the development branch and cherry picking into the stable branch.

Thursday, May 7, 2009

Directory::Transactional

I recently remembered miyagawa's 20 modules talk from YAPC::Asia::2008. At the end he asked that other CPAN authors give a similar talk about some of their modules. I think that instead of giving a talk I will try to write a series of posts.

Directory::Transactional is a module I wrote for KiokuDB's plain files backend. It provides full ACID guarantees (as long as you also use it for all your read operations too) on an arbitrary set of files.

The interface revolves around a handle through which you create transactions (txn_do), and open all file and directory handles. For example if you do my $fh = $h->openw("file.txt") then $fh will be a filehandle open for writing to a copy of file.txt in a shadow directory created for the current transaction.

One cool feature is the auto commit implementation. By using Hash::Util::FieldHash::Compat we can track the lifetime of all returned resources. The first resource created outside of a transaction causes one to be opened. When the last resource goes out of scope the transaction is committed. Perl's reference counting can be a pain sometimes, but it also enables some really cool hacks.

The most fun I had writing this module was the test suite. On UNIX platforms the crash recovery stress test forks off a bunch of concurrent workers and then randomly issues a kill -9 every once in a while. Meanwhile a fixture loop is continually checking that the read values are always consistent. The test itself updates several "bank account" text files (each contains a number), and fixture assures that the accounts are always balanced. The actual update has additional delays to make sure that the files are not updated in the same OS time slice, and there truly is lock contention.

One major limitation is that it doesn't detect deadlocks if you access files out of order. I've been toying with maintaining a lock table, but that seems like a lot of work. If you are running on HPUX the OS will detect flock deadlocks and return EDEADLK, causing the transaction to rollback. Another option is to use the global option for deadlock prone code. This creates a single top level lock.

In the future I hope to steal File::Transaction::Atomic's atomic symlink swapping hack. This will allow readers to safely work with the files without using a lock (though they won't benefit from the isolation part of ACID).

Wednesday, May 6, 2009

Using KiokuDB in Catalyst applications

Using KiokuDB with Catalyst is very easy. This article sums up a few lessons learned from the last several apps we've developed at my workplace, and introduces the modules we refactored out of them.

Let's write an app is called Kitten::Friend, in which kittens partake in a social network and upload pictures of vases they've broken.

We generally follow these rules for organizing our code:

  • Catalyst stuff goes under Kitten::Friend::Web::.
  • Reusable app model code goes under Kitten::Friend::Model::.
  • The actual domain objects go under Kitten::Friend::Schema::, for instance Kitten::Friend::Schema::Vase. Using DBIx::Class::Schema these would be the table classes.

Anything that is not dependent on the Catalyst environment (as much code as possible) is kept separate from it. This means that we can use our KiokuDB model with all the convenience methods for unit testing or scripts, without configuring the Catalyst specific bits.

Functionality relating to how the app actually behaves is put in the Schema namespace. We try to keep this code quite pure.

Glue code and helper methods go in the Model namespace. This separation helps us to refactor and adapt the code quite easily.

So let's start with the schema objects. Let's say we have two classes. The first is Kitten::Friend::Schema::Kitten:


package Kitten::Friend::Schema::Kitten;
use Moose;

use Kitten::Friend::Schema::Vase;

use KiokuDB::Set;

use KiokuDB::Util qw(set);

use namespace::autoclean;

with qw(KiokuX::User);    # provides 'id' and 'password' attributes

has name => (
    isa      => "Str",
    is       => "ro",
    required => 1,
);

has friends => (
    isa     => "KiokuDB::Set",
    is      => "ro",
    lazy    => 1,
    default => sub { set() },    # empty set
);

has vases => (
    isa     => "KiokuDB::Set",
    is      => "ro",
    lazy    => 1,
    default => sub { set() },
);

sub new_vase {
    my ( $self, @args ) = @_;

    my $vase = Kitten::Friend::Schema::Vase->new(
        owner => $self,
        @args,
    );

    $self->vases->insert($vase);

    return $vase;
}

sub add_friend {
    my ( $self, $friend ) = @_;

    $self->friends->insert($friend);
}

1;

I've used the KiokuX::User role to provide the Kitten object with some standard attributes for user objects. This will be used later to provide authentication support.

The second class is Kitten::Friend::Schema::Vase:


package Kitten::Friend::Schema::Vase;
use Moose;

use MooseX::AttributeHelpers;
use URI;

use namespace::autoclean;

has pictures => (
    metaclass => "Collection::Array",
    isa       => "ArrayRef[URI]",
    is        => "ro",
    default   => sub { [] },
    provides  => {
        push => "add_picture",
    },
);

has owner => (
    isa      => "Kitten::Friend::Schema::Kitten",
    is       => "ro",
    required => 1,
);

1;

Now let's write a unit test:


use strict;
use warnings;

use Test::More 'no_plan';

use KiokuX::User::Util qw(crypt_password);

use ok 'Kitten::Friend::Schema::Kitten';

my $kitten = Kitten::Friend::Schema::Kitten->new(
    name     => "Snookums",
    id       => "cutesy843",
    password => crypt_password("luvt00na"),
);

isa_ok( $kitten, "Kitten::Friend::Schema::Kitten" );

is( $kitten->name, "Snookums", "name attribute" );

ok( $kitten->check_password("luvt00na"), "password check" );
ok( !$kitten->check_password("bathtime"), "bad password" );

is_deeply( [ $kitten->friends->members ], [ ], "no friends" );
is_deeply( [ $kitten->vases->members ], [ ], "no no vases" );

my $vase = $kitten->new_vase(
    pictures => [ URI->new("http://icanhaz.com/broken_vase") ],
);

isa_ok( $vase, "Kitten::Friend::Schema::Vase" );

is_deeply( [ $kitten->vases->members ], [ $vase ], "new vase added" );

This test obviously runs completely independently of either Catalyst or KiokuDB.

The next step is to set up the model. We use KiokuX::Model as our model base class. The model class usually contains helper methods, like txn_do call wrappers and various other storage oriented tasks we want to abstract away. This way the web app code and scripts get a simpler API that takes care of as many persistence details as possible.


package Kitten::Friend::Model::KiokuDB;
use Moose;

extends qw(KiokuX::Model);

sub insert_kitten {
    my ( $self, $kitten ) = @_;

    my $id = $self->txn_do(sub {
        $self->store($kitten);
    });

    return $id;
}

1;

We can write a t/model.t to try this out:


use strict;
use warnings;

use Test::More 'no_plan';

use ok 'Kitten::Friend::Model::KiokuDB';

my $m = Kitten::Friend::Model::KiokuDB->new( dsn => "hash" );

{
    my $s = $m->new_scope;

    my $id = $m->insert_kitten(
         Kitten::Friend::Schema::Kitten->new(
            name     => "Kitteh",
            id       => "kitteh",
            password => crypt_password("s33krit"),
        ),
    );

    ok( $id, "got an ID" );

    my $kitten = $m->lookup($id);

    isa_ok( $kitten, "Kitten::Friend::Schema::Kitten", "looked up object" );
}

Next up is gluing this into the Catalyst app itself. I'm assuming you generated the app structure with catalyst.pl Kitten::Friend::Web. Create Kitten::Friend::Web::Model::KiokuDB as a subclass of Catalyst::Model::KiokuDB:


package Kitten::Friend::Web::Model::KiokuDB;
use Moose;

use Kitten::Friend::Model::KiokuDB;

BEGIN { extends qw(Catalyst::Model::KiokuDB) }

has '+model_class' => ( default => "Kitten::Friend::Model::KiokuDB" );

1;
And then configure a DSN in your Web.pm or configuration file:
>Model KiokuDB<
    dsn dbi:SQLite:dbname=root/db
>/Model<

That's all that's necessary to glue our Catalyst independent model code into the web app part.

Your model methods can be called as:


my $m = $c->model("kiokudb");

my $inner = $m->model

$inner->insert_kitten($kitten);

In the future I might consider adding an AUTOLOAD method, but you can also just extend the model attribute of Catalyst::Model::KiokuDB to provide more delegations (currently it only delegates KiokuDB::Role::API).

If you'd like to use Catalyst::Plugin::Authentication, configure it as follows:


__PACKAGE__->config(
    'Plugin::Authentication' => {
        realms => {
            default => {
                credential => {
                    class         => 'Password',
                    password_type => 'self_check'
                },
                store => {
                    class      => 'Model::KiokuDB',
                    model_name => "kiokudb",
                }
            }
        }
    },
);

And then you can let your kittens log in to the website:


my $user = eval {
    $c->authenticate({
        id       => $id,
        password => $password,
    });
};

if ( $user ) {
    $c->response->body( "Hello " . $user->get_object->name )
}

Some bonus features of the Catalyst model:

  • It calls new_scope for you, once per request
  • It tracks leaked objects and reports them in Debug mode. Circular structures are a bit tricky to get right if you aren't used to them so this is a big help.

Tuesday, May 5, 2009

Nordic Perl Workshop

Last month I attended the Nordic Perl Workshop.

It was definitely one of the best Perl conferences I've been to. Fun all around, great people, and well organized.

Perl people often joke about JIT slide writing. Well this time I was a good boy and wrote my slides in the morning, way before my talk. What I did write at the last minute was the slide software. More on this soon.

I was actually productive during the hackathon (usually I'm only productive because of hackathons, afterwords). mst and I found and fixed some hairy Moose issues involving metaclass compatibility and the immutable code generation. I also ported over Rakudo's MMD resolution algorithm to MooseX::Types::VariantTable, so that rafl could finish this bit of awesome. Lastly, I had my hair set on fire by Ingy

.

Sunday, May 3, 2009

Become a Git Junkie

Lots of Perl projects have been switching over to Git lately, most of them from Subversion.

Unfortunately this means that many people are using Git as if it was Subversion, and consequentially they are missing out on a lot.

Here are a few pointers to help you get to know some of Git's more powerful features:

  1. Read Git for Computer Scientists, and if you want a more in depth followup, Git from the Bottom Up. These two articles will help you understand how Git works on the inside. Git's terminology will suddenly seem clear, and you'll easily understand how merges are represented, what git rebase does, and why Git does copy/move detection after the fact.
  2. Play around with low level commands. For instance, try creating a commit by hand by manipulating the index and using git commit-tree.
  3. Get comfortable with git rev-parse. Git has a very powerful syntax for specifying revisions, and this syntax is used in virtually every command.
  4. Knowing how to specify revisions accurately will let you back out of mistakes with confidence using git reset --hard and git reflog.
  5. Get to know git push's refspec syntax, and how to get it to do exactly what you want. Once you're fixing mistakes in history you occasionally need to use the --force flag, but git push is a little more flexible than that =).
  6. Start using git rebase --interactive and git commit --amend. These two commands help you create a clean history. No more git commit -m "oops, forgot that other file..."
  7. Explore some of Git's more esoteric features, like submodules, git fast-import & git fast-export, the hook system, git filter-branch, and the .git/info/grafts file. It's fun and useful to figure out how these features work even if you don't need to use them right now.

Friday, May 1, 2009

KiokuDB's First Year (give or take)

So, for lack of a better topic to talk about as my first post, I will orate about my latest large-ish project, KiokuDB.

There already is a fair amount of information about KiokuDB on the intertubes now. Most of it can be found from its project homepage (take a look at the talks or the architectural overview). I think instead of explaining what it is, I will try and tell why and how this project came to be.

KiokuDB has very humble beginings as a toy project by my coworker Jonathan Rockway, called MooseX::Storage::Directory. The idea was being able to use MooseX::Storage to serialize objects into YAML files and then fetch them back easily. Jon worked out a very cute API but MooseX::Storage was just not powerful enough to really be useful for storing complex data.

I was very interested in writing a "proper" object database ever since I started programming in Perl, but never tried because it's such a difficult task. In fact, Perl already has two similar projects, Pixie and Tangram, both of which try to provide an OO focused approach to persistence (as opposed to say DBIx::Class which is truer to the relational model). Unfortunately neither of those was very popular, and I suspect the reason was skepticism; people just didn't believe the transparency would work reliably in a language as rich and crazy as Perl.

Given the way things were going out with Moose in the last few years I felt like it was a good time to reinvent that wheel once more, leveraging Moose for the transparency while keeping very conservative defaults elsewhere. Persistence is a hairy problem, but since Moose based classes have so much meta data it's easy much easier to do the right thing for objects of those classes.

In May of 2008 I started sketching out an initial design, re-reviewing the MooseX::Storage code, talking at length with Sam Vilain, and googling for similar projects. I wasn't doing any coding at all but ideas were materializing in my brain. In July Stevan, my boss, told me that he'd like to use this for our next $work app, leading to the first commit on KiokuDB.

By September we had a KiokuDB backed website running on the Berkeley DB backend. This site was doing simple queries and navigational presentation of the data. Our first impressions were that object databases are indeed much more natural to use for that kind of data. This project begat many KiokuDB related features, like the initial version of what would become Catalyst::Model::KiokuDB.

Since then we've developed 4 more applications. One of these makes heavy use of relational data using Fey for "real" SQL (no OO inflation involved, lots of aggregate operations, etc), and KiokuDB for everything else. Two other apps include a CAS versioned schema, closely inspired by Git's versioning model.

Lately the project is also beginning to gather a community. KiokuDB powers Thumb-Rate.com and the Thumb-Rate app in the Apple iPhone Store. The #kiokudb IRC channel is also quite lively. In short, it's not just II that's using it for fun and profit ;-)

At least for us KiokuDB has been very successful so far. The amount of effort involved in prototyping apps has gone down, and the prototyping code very easily evolves into production quality code later as features are needed. Schema changes amount to simply refactoring the object model. It also reduced the amount of ad hoc non relational data stores. Using an ORM for simple configuration data is overkill, but with KiokuDB it's very natural.

That said, the project is still quite new, with many ideas left to explore. The biggest missing piece is probably Search::GIN which has very ambitous goals, but currently realizes only a handful.

Quite a long first post, I suppose. I guess I must be motivated ;-)

Blame mst

My name is Yuval, and I am a Perl hacker, at least as far as my brand new Internet Diary is concerned!

I'm fairly bad at sustaining habits, writing, and coming up with ideas to share, so that probably means I will probably be an awful blogger. However, since so many of my fellow programmers are stepping up to the Iron Man challenge, I thought I'd give it a try as well. The worst that could happen is that mst's hair will not be colored be purple with glitters.