Thursday, November 26, 2009

The timing of values in imperative APIs

Option configuration is a classic example of when I prefer a purely functional approach. This post is not about broken semantics, but rather about the tension between ease of implementation and ease of use.

Given Perl's imperative heritage, many modules default to imperative option specification. This means that the choice of one behavior over another is represented by an action (setting the option), instead of a value.

Actions are far more complicated than values. For starters, they are part of an ordered sequence. Secondly, it's hard to know what the complete set of choices is, and it's hard to correlate between choices. And of course the actual values must still be moved around.

A simple example is Perl's built in import mechanism.

When you use a module, you are providing a list of arguments that passed to two optional method calls on the module being loaded, import and VERSION.

Most people know that this:

use Foo;

Is pretty much the same as this:

BEGIN {
    require Foo;
    Foo->import();
}

There's also a secondary syntax, which allows you to specify a version:

use Foo 0.13 qw(foo bar);

The effect is the same as:

BEGIN {
    require Foo;
    Foo->VERSION(0.13);
    Foo->import(qw(foo bar));
}

UNIVERSAL::VERSION is pretty simple, it looks at the version number and compares it with $Foo::VERSION and then complains loudly if $Foo::VERSION isn't recent enough.

But what if we wanted to do something more interesting, for instance adapt the exported symbols to be compatible with a certain API version?

This is precisely why VERSION is an overridable class method, but this flexibility is still very far from ideal.

my $import_version;

sub VERSION {
    my ( $class, $version ) = @_;

    # first verify that we are recent enough
    $class->SUPER::VERSION($version);

    # stash the value that the user specified
    $import_version = $version;
}

sub import {
    my ( $class, @import ) = @_;

    # get the stashed value
    my $version = $import_version;

    # clear it so it doesn't affect subsequent imports
    undef $import_version;

    ... # use $version and @imports to set things up correctly
}

This is a shitty solution because really all we want is a simple value, but we have to juggle it around using a shared variable.

Since the semantics of import would have been made more complex by adding this rather esoteric feature, the API was made imperative instead, to allow things to be optional.

But the above code is not only ugly, it's also broken. Consider this case:

package Evil;
use Foo 0.13 (); # require Foo; Foo->VERSION;

package Innocent;
use Foo qw(foo bar); # require Foo; Foo->import;

In the above code, Evil is causing $import_version to be set, but import is never called. The next invocation of import comes from a completely unrelated consumer, but $import_version never got cleared.

We can't use local to keep $import_version properly scoped (it'd be cleared before import is called). The best solution I can come up with is to key it in a hash by caller(), which at least prevents pollution. This is something every implementation of VERSION that wants to pass the version to import must do to be robust.

However, even if we isolate consumers from each other, the nonsensical usage use Foo 0.13 () which asks for a versioned API and then proceeds to import nothing, still can't be detected by Foo.

We have 3 * 2 = 6 different code paths[1] for the different variants of use Foo, one of which doesn't even make sense (VERSION but no import), two of which have an explicit stateful dependency between two parts of the code paths (VERSION followed by import, in two variants), and two of which have an implicit stateful dependency (import without VERSION should get undef in $import_version). This sort of combinatorial complexity places the burden of ensuring correctness on the implementors of the API, instead of the designer of the API.

It seems that the original design goal was to minimize the complexity of the most common case (use Foo, no VERSION, and import called with no arguments), but it really makes things difficult for the non default case, somewhat defeating the point of making it extensible in the first place (what good is an extensible API if nobody actually uses it to its full potential).

In such cases my goal is often to avoid fragmenting the data as much as possible. If the version was an argument to import which defaulted to undef people would complain, but that's just because import uses positional arguments. Unfortunately you don't really see this argument passing style in the Perl core:

sub import {
    my ( $class, %args ) = @_;

    if ( exists $args{version} ) {
        ...
    }
    ... $args{import_list};
}

This keeps the values together in both space and time. The closest thing I can recall from core Perl is something like $AUTOLOAD. $AUTOLOAD does not address space fragmentation (an argument is being passed using a a variable instead of an argument), but it at leasts solves the fragmentation in time, the variable is reliably set just before the AUTOLOAD routine is invoked.

Note that if import worked like this it would still be far from pure, it mutates the symbol table of its caller, but the actual computation of the symbols to export can and should be side effect free, and if the version were specified in this way that would have been easier.

This is related to the distinction between intention and algorithm. Think of it this way: when you say use Foo 0.13 qw(foo bar), do you intend to import a specific version of the API, or do you intend to call a method to set the version of the API and then call a method to import the API? The declarative syntax has a close affinity to the intent. On the other hand, looking at it from the perspective of Foo, where the intent is to export a specific version of the API, the code structure does not reflect that at all.

Ovid wrote about a similar issue with Test::Builder, where a procedural approach was taken (diagnosis output is treated as "extra" stuff, not really a part of a test case's data).

Moose also suffers from this issue in its sugar layer. When a Moose class is declared the class definition is modified step by step, causing load time performance issues, order sensitivity (often you need to include a role after declaring an attribute for required method validation), etc.

Lastly, PSGI's raison d'etre is that the CGI interface is based on stateful values (%ENV, globally filehandles). The gist of the PSGI spec is encapsulating those values into explicit arguments, without needing to imperatively monkeypatch global state.

I think the reason we tend to default to imperative configuration is out of a short sighted laziness[2]. It seems like it's easier to be imperative, when you are thinking about usage. For instance, creating a data type to encapsulate arguments is tedius. Dealing with optional vs. required arguments manually is even more so. Simply forcing the user to specify everything is not very Perlish. This is where the tension lies.

The best compromise I've found is a multilayered approach. At the foundation I provide a low level, explicit API where all of the options are required all at once, and cannot be changed afterwords. This keeps the combinatorial complexity down and lets me do more complicated validation of dependent options. On top of that I can easily build a convenience layer which accumulates options from an imperative API and then provides them to the low level API all at once.

This was not done in Moose because at the time we did not know to detect the end of a .pm file, so we couldn't know when the declaration was finished[3].

Going back to VERSION and import, this approach would involve capturing the values as best we in a thin import (the sugar layer), and passing them onwards together to some underlying implementation that doesn't need to worry about the details of collecting those values.

In my opinion most of the time an API doesn't actually merit a convenience wrapper, but if it does then it's easy to develop one. Building on a more verbose but ultimately simpler foundation usually makes it much easier to write something that is correct, robust, and reusable. More importantly, the implementation is also easier to modify or even just replace (using polymorphism), since all the stateful dependencies are encapsulated by a dumb sugar layer.

Secondly, when the sugar layer is getting in the way, it can just be ignored. Instead of needing to hack around something, you just need to be a little more verbose.

Lastly, I'd also like to cite the Unix philosophy, another strong influence on Perl: do one thing, and do it well[4]. The anti pattern is creating one thing that provides two features: a shitty convenience layer and a limited solution to the original problem. Dealing with each concern separately helps to focus on doing the important part, and of course doing it well ;-)

This post's subject matter is obviously related to another procedural anti-pattern ($foo->do_work; my $results = $foo->results vs my $results = $foo->do_work). I'll rant about that one in a later post.

[1]

use Foo;
use Foo 0.13;
use Foo qw(foo bar);
use Foo 0.13 qw(Foo Bar);
use Foo ();
use Foo 0.13 ();

and this doesn't even account for manual invocation of those methods, e.g. from delegating import routines.

[2] This is the wrong kind of laziness, the virtuous laziness is long term

[3] Now we have B::Hooks::EndOfScope

[4] Perl itself does many things, but it is intended to let you write things that do one thing well (originally scripts, though nowadays I would say the CPAN is a much better example)

5 comments:

Anonymous said...

use Foo 0.13 ();

Makes sense if you wanted to do use Foo (), but discovered 0.12 had an ugly bug.

nothingmuch said...

Yes, but not if VERSION affects import

Aristotle said...

I made essentially the same point in Expand then contract, though coming at it from a different direction.

nothingmuch said...

That perspective is a lot more abstract.

I have a draft post on it, similarly abstract, but I wanted to drive home some practical examples first =)

What I was going to focus on was the process of detecting what exactly should, in your terms, be expanded (stuff that is implied rather than explicitly present in the data), and where to draw the line between decorating something and transforming into a new structure.

I personally see the lack of an expand/contract approach as possibly the single biggest shortcoming in software today. The only type of software that has really embraced it is compilers (and even then, many don't).

I think that the cognitive process is really what matters, but that it doesn't just stop at strength reduction of the combinatorial complexity, there's also the question of how refined a model is, and how it allows you to think about the problem space better, producing much better results in the contract phase.

Aristotle said...

Yeah, I know. At the time, I was thinking about XML::Atom::SimpleFeed, which tries to do things like omit type="text" when it outputs a text construct (because text is the default type) by stringing together conditionals that check for all the conceivable variations of the input. I wrote that article after I realised that it would make the whole code much less finicky if I first expanded all input to full generality, with all default cases made explicit etc., and then in a second phase emitted output by applying the omissions etc. to that filled-in structure.

(And I am finally soon going to be refactoring the module based on that insight… unfortunately I had to shave the XML generation yak first, which no one on CPAN appears to have done a good job of.)