Thursday, November 26, 2009

The timing of values in imperative APIs

Option configuration is a classic example of when I prefer a purely functional approach. This post is not about broken semantics, but rather about the tension between ease of implementation and ease of use.

Given Perl's imperative heritage, many modules default to imperative option specification. This means that the choice of one behavior over another is represented by an action (setting the option), instead of a value.

Actions are far more complicated than values. For starters, they are part of an ordered sequence. Secondly, it's hard to know what the complete set of choices is, and it's hard to correlate between choices. And of course the actual values must still be moved around.

A simple example is Perl's built in import mechanism.

When you use a module, you are providing a list of arguments that passed to two optional method calls on the module being loaded, import and VERSION.

Most people know that this:

use Foo;

Is pretty much the same as this:

BEGIN {
    require Foo;
    Foo->import();
}

There's also a secondary syntax, which allows you to specify a version:

use Foo 0.13 qw(foo bar);

The effect is the same as:

BEGIN {
    require Foo;
    Foo->VERSION(0.13);
    Foo->import(qw(foo bar));
}

UNIVERSAL::VERSION is pretty simple, it looks at the version number and compares it with $Foo::VERSION and then complains loudly if $Foo::VERSION isn't recent enough.

But what if we wanted to do something more interesting, for instance adapt the exported symbols to be compatible with a certain API version?

This is precisely why VERSION is an overridable class method, but this flexibility is still very far from ideal.

my $import_version;

sub VERSION {
    my ( $class, $version ) = @_;

    # first verify that we are recent enough
    $class->SUPER::VERSION($version);

    # stash the value that the user specified
    $import_version = $version;
}

sub import {
    my ( $class, @import ) = @_;

    # get the stashed value
    my $version = $import_version;

    # clear it so it doesn't affect subsequent imports
    undef $import_version;

    ... # use $version and @imports to set things up correctly
}

This is a shitty solution because really all we want is a simple value, but we have to juggle it around using a shared variable.

Since the semantics of import would have been made more complex by adding this rather esoteric feature, the API was made imperative instead, to allow things to be optional.

But the above code is not only ugly, it's also broken. Consider this case:

package Evil;
use Foo 0.13 (); # require Foo; Foo->VERSION;

package Innocent;
use Foo qw(foo bar); # require Foo; Foo->import;

In the above code, Evil is causing $import_version to be set, but import is never called. The next invocation of import comes from a completely unrelated consumer, but $import_version never got cleared.

We can't use local to keep $import_version properly scoped (it'd be cleared before import is called). The best solution I can come up with is to key it in a hash by caller(), which at least prevents pollution. This is something every implementation of VERSION that wants to pass the version to import must do to be robust.

However, even if we isolate consumers from each other, the nonsensical usage use Foo 0.13 () which asks for a versioned API and then proceeds to import nothing, still can't be detected by Foo.

We have 3 * 2 = 6 different code paths[1] for the different variants of use Foo, one of which doesn't even make sense (VERSION but no import), two of which have an explicit stateful dependency between two parts of the code paths (VERSION followed by import, in two variants), and two of which have an implicit stateful dependency (import without VERSION should get undef in $import_version). This sort of combinatorial complexity places the burden of ensuring correctness on the implementors of the API, instead of the designer of the API.

It seems that the original design goal was to minimize the complexity of the most common case (use Foo, no VERSION, and import called with no arguments), but it really makes things difficult for the non default case, somewhat defeating the point of making it extensible in the first place (what good is an extensible API if nobody actually uses it to its full potential).

In such cases my goal is often to avoid fragmenting the data as much as possible. If the version was an argument to import which defaulted to undef people would complain, but that's just because import uses positional arguments. Unfortunately you don't really see this argument passing style in the Perl core:

sub import {
    my ( $class, %args ) = @_;

    if ( exists $args{version} ) {
        ...
    }
    ... $args{import_list};
}

This keeps the values together in both space and time. The closest thing I can recall from core Perl is something like $AUTOLOAD. $AUTOLOAD does not address space fragmentation (an argument is being passed using a a variable instead of an argument), but it at leasts solves the fragmentation in time, the variable is reliably set just before the AUTOLOAD routine is invoked.

Note that if import worked like this it would still be far from pure, it mutates the symbol table of its caller, but the actual computation of the symbols to export can and should be side effect free, and if the version were specified in this way that would have been easier.

This is related to the distinction between intention and algorithm. Think of it this way: when you say use Foo 0.13 qw(foo bar), do you intend to import a specific version of the API, or do you intend to call a method to set the version of the API and then call a method to import the API? The declarative syntax has a close affinity to the intent. On the other hand, looking at it from the perspective of Foo, where the intent is to export a specific version of the API, the code structure does not reflect that at all.

Ovid wrote about a similar issue with Test::Builder, where a procedural approach was taken (diagnosis output is treated as "extra" stuff, not really a part of a test case's data).

Moose also suffers from this issue in its sugar layer. When a Moose class is declared the class definition is modified step by step, causing load time performance issues, order sensitivity (often you need to include a role after declaring an attribute for required method validation), etc.

Lastly, PSGI's raison d'etre is that the CGI interface is based on stateful values (%ENV, globally filehandles). The gist of the PSGI spec is encapsulating those values into explicit arguments, without needing to imperatively monkeypatch global state.

I think the reason we tend to default to imperative configuration is out of a short sighted laziness[2]. It seems like it's easier to be imperative, when you are thinking about usage. For instance, creating a data type to encapsulate arguments is tedius. Dealing with optional vs. required arguments manually is even more so. Simply forcing the user to specify everything is not very Perlish. This is where the tension lies.

The best compromise I've found is a multilayered approach. At the foundation I provide a low level, explicit API where all of the options are required all at once, and cannot be changed afterwords. This keeps the combinatorial complexity down and lets me do more complicated validation of dependent options. On top of that I can easily build a convenience layer which accumulates options from an imperative API and then provides them to the low level API all at once.

This was not done in Moose because at the time we did not know to detect the end of a .pm file, so we couldn't know when the declaration was finished[3].

Going back to VERSION and import, this approach would involve capturing the values as best we in a thin import (the sugar layer), and passing them onwards together to some underlying implementation that doesn't need to worry about the details of collecting those values.

In my opinion most of the time an API doesn't actually merit a convenience wrapper, but if it does then it's easy to develop one. Building on a more verbose but ultimately simpler foundation usually makes it much easier to write something that is correct, robust, and reusable. More importantly, the implementation is also easier to modify or even just replace (using polymorphism), since all the stateful dependencies are encapsulated by a dumb sugar layer.

Secondly, when the sugar layer is getting in the way, it can just be ignored. Instead of needing to hack around something, you just need to be a little more verbose.

Lastly, I'd also like to cite the Unix philosophy, another strong influence on Perl: do one thing, and do it well[4]. The anti pattern is creating one thing that provides two features: a shitty convenience layer and a limited solution to the original problem. Dealing with each concern separately helps to focus on doing the important part, and of course doing it well ;-)

This post's subject matter is obviously related to another procedural anti-pattern ($foo->do_work; my $results = $foo->results vs my $results = $foo->do_work). I'll rant about that one in a later post.

[1]

use Foo;
use Foo 0.13;
use Foo qw(foo bar);
use Foo 0.13 qw(Foo Bar);
use Foo ();
use Foo 0.13 ();

and this doesn't even account for manual invocation of those methods, e.g. from delegating import routines.

[2] This is the wrong kind of laziness, the virtuous laziness is long term

[3] Now we have B::Hooks::EndOfScope

[4] Perl itself does many things, but it is intended to let you write things that do one thing well (originally scripts, though nowadays I would say the CPAN is a much better example)

Saturday, November 21, 2009

Restricted Perl

zby's comments on my last post got me thinking. There are many features in Perl that we no longer use, or that are considered arcane or bad style, or even features we could simply live without. However, if they were removed, lots of code would break. So we keep those features, and we keep writing new code that uses them.

Suppose there was a pragma, similar to no indirect in that it restricts existing language features, and similar strict in that it lets you opt out of unrelated discouraged behaviors.

I think this would be an interesting baby step towards solving some of the problems that plague Perl code today:

  • Features that are often misused and need lots of critique.
  • Language features that are hard to change in the interpreter's implementation, limiting the revisions we can make to Perl 5.
  • Code that will be hard to translate to Perl 6, for no good reason.

On top of that one could implement several different defaults sets of feature-restricted Perl (sort of like Modern::Perl).

Instead of designing some sort of restricted subset of Perl 5 from the bottom up, several competing subsets could be developed organically, and if memory serves me right that is something we do quite well in our community =)

So anyway, what are some things that you could easily live without in Perl? What things would you be willing to sacrifice if it meant you could trade them off for other advantages? Which features would you rather disallow as part of a coding standard?

My take

Here are some ideas. They are split up into categories which are loosely related, but don't necessarily go hand in hand (some of them even contradict slightly).

They are all of a reasonable complexity to implement, either validating something or removing a language feature in a lexical scope.

It's important to remember that these can be opted out of selectively, when you need them, just like you can say no warnings 'uninitialized' when stringifying undef is something you intentionally allowed.

Restrictions that would facilitate static modularity

The first four restrictions make it possible to treat .pm files as standalone, cacheable compilation units. The fifth also allows for static linkage (no need to actually invoke import when evaluating a use statement), since the semantics of import are statically known. This could help alleviate startup time problems with Perl code, per complicit compilation unit (without needing to solve the problem as a whole by crippling the adhoc nature of Perl's compile time everywhere).

  • Disallow recursive require.
  • Disallow modification to a package's symbol table after its package declaration goes out of scope.
  • Restrict a file to to only one package (which must match the .pm file name).
  • Disallow modification of other packages other than the currently declared one.
  • Restrict the implementation of import to a statically known one.
  • Disallow access to external symbols that are not bound at compile time (e.g. variables from other packages, subroutines which weren't predeclared (fully qualified is OK).

Restrictions that allow easier encapsulation of side effects

These restrictions address pollution of state between unrelated bits of code that have interacting dynamic scopes.

  • Disallow modification of any global variables that control IO behavior, such as $/, $|, etc, as well as code that depends on them. IO::Handle would have to be augmented a bit to allow per handle equivalents, but it's most of the way there.
  • Disallow such variables completely, instead requiring a trusted wrapper for open that sets them at construction time and leaves them immutable thereafter.
  • Disallow /g matches on anything other than private lexicals (sets pos)
  • Disallow $SIG{__WARN__}, $SIG{__DIE__}, and $^S
  • Disallow eval (instead, use trusted code that gets local $@ right)
  • Disallow use of global variables altogether. For instance, instead of $! you'd rely on autodie, for @ARGV handling you'd use MooseX::Getopt or App::Cmd.
  • Disallow mutation through references (only private lexical variables can be modified directly, and complex data structures are therefore immutable after being constructed). This has far reaching implications for object encapsulation, too.

Restrictions that would encourage immutable data.

These restrictions alleviate some of the mutation centric limitations of the SV structure, that make lightweight concurrency impossible without protecting every variable access with a mutex. This would also allow aggressive COW.

  • Only allow assignment to a variable at its declaration site. This only applies to lexicals.
  • Allow only a single assignment to an SV (by reference or directly. Once an SV is given a value it becomes readonly)
  • Disallow assignment modification of external variables (non lexicals, and closure captures). This is a weaker guarantee than the previous one (which is also much harder to enforce), but with similar implications (all assignment is guaranteed to have side effects that outlive its lexical scope)

Since many of the string operations in Perl are mutating, purely functional variants should be introduced (most likely as wrappers).

Implicit mutations (such as the upgrading of an SV due to numification) typically results in a copy, so multithreaded access to immutable SVs could either pessimize the caching or just use a spinlock on upgrades.

Restrictions that would facilitate functional programming optimizations

These restrictions would allow representing simplified optrees in more advanced intermediate forms, allowing for interesting optimization transformations.

  • Disallow void context expressions
  • ...except for variable declarations (with the afore mentioned single use restrictions, this effectively makes every my $x = ... into a let style binding)
  • Allow only a single compound statement per subroutine, apart from let bindings (that evaluates to the return value). This special cases if blocks to be treated as a compound statement due to the way implicit return values work in Perl.
  • Disallow opcodes with non local side effects (including calls to non-verified subroutines) for purely functional code.

This is perhaps the most limiting set of restrictions. This essentially lets you embed lambda calculus type ASTs natively in Perl. Alternative representations for this subset of Perl could allow lisp style macros and other interesting compile time transformations, without the difficulty of making that alternative AST feature complete for all of Perl's semantics.

Restrictions that facilitate static binding of OO code

Perl's OO is always late bound, but most OO systems can actually be described statically. These restrictions would allow you to opt in for static binding of OO dispatch for a given hierarchy, in specific lexical scopes. This is a little more complicated than just lexical restrictions on features, since metadata about the classes must be recorded as well.

  • Only allow blessing into a class derived from the current package
  • Enforce my Class $var, including static validation of method calls
  • Disallow introduction of additional classes at runtime (per class hierarchy or alltogether)
  • Based on the previous two restrictions, validate method call sites on typed variable invocants as static subroutine calls (with several target routines, instead of one)
  • Similar the immutable references restriction above, disallow dereferencing of any blessed reference whose class is not derived from the current package.

Restrictions that are easy to opt in to in most code (opting out only as necessary)

These features are subject to lots of criticism, and their usage tends to be discouraged. They're still useful, but in an ideal world they would probably be implemented as CPAN modules.

  • Disallow formats
  • Disallow $[
  • Disallow tying and usage of tied variables
  • Disallow overloading (declaration of overloads, as well as their use)

A note about implementation

Most of these features can be implemented in terms of opcheck functions possibly coupled scrubbing triggered by and end of scope hook. Some of them are static checks at use time. A few others require more drastic measures. For related modules see indirect, Safe, Sys::Protect, and Devel::TypeCheck to name but a few

I also see a niche for modules that implement alternatives to built in features, disabling the core feature and providing a better alternative that replaces it instead of coexisting with it. This is the next step in exploratory language evolution as led by Devel::Declare.

The difficulty of modernizing Perl 5's internals is the overwhelming amount of orthogonal concerns whenever you try to implement something. Instead of trying to take care of these problems we could make it possible for the user to promise they won't be an issue. It's not ideal, but it's better than nothing at all.

The distant future

If this sort of opt-out framework turns out to be successful, there's no reason why use 5.20.0 couldn't disable some of the more regrettable features by default, so that you have to explicitly ask for them instead. This effectively makes Perl's cost model per-per-use, instead of always pay.

This would also increase the likelihood that people stop using such features in new code, and therefore the decision making aspects of the feature deprecation process would be easier to reason about.

Secondly, and perhaps more importantly, it would be possible to try for alternative implementations of Perl 5 with shorter termed deliverables.

Compiling a restricted subset of Perl to other languages (for instance client side JavaScript, different bytecodes, adding JIT support, etc) is a much easier task than implementing the language as a whole. If more feature restricted Perl code would be written and released on the CPAN, investments in such projects would be able to produce useful results sooner, and have clearer indications of progress.

Wednesday, November 18, 2009

Functional programming and unreasonable expectations

<record type="broken">I'm a big fan of purely functional programming</record>.

Another reason I like it so much is that purely functional software tends to be more reliable. Joe Armstrong of Erlang fame makes that point in an excellent talk much better than I could ever hope to.

However, one aspect he doesn't really highlight is that reliability is not only good for keeping your system running, it also makes it easier to program.

When a function is pure it is guaranteed to be isolated from other parts of the program. This separation is makes it much easier to change the code in one place without breaking anything unrelated.

Embracing this style of programming has had one huge drawback though: it utterly ruined my expectations of non functional code.

In imperative languages it's all too easy to add unstated assumptions about global state. When violated, these assumptions then manifest in very ugly and surprising ways (typically data corruption).

A good example is reentrancy (or rather the lack thereof) in old style C code. Reentrant code can be freely used in multiple threads, from inside signal handlers, etc. Conversely, non-reentrant routines may only be executed once at a given point in time. Lack of foresight in early C code meant that lots of code had to be converted to be reentrant later on. Since unstated assumptions are by definition hidden this can be a difficult and error prone task.

The specific disappointment that triggered this post is Perl's regular expression engine.

Let's say we're parsing some digits from a string and we want to create a SomeObject with those digits. Easy peasy:

$string =~ m/(\d+)/;
push @results, SomeObject->new( value => $1 );

Encapsulating that match into a resuable regex is a little harder though. Where does the post processing code go? Which capture variable does it use? Isolation would have been nice. The following example might work, but it's totally wrong:

my $match_digits = qr/(\d+)/;

my $other_match = qr{ ... $match_digits ... }x;

$string =~ $other_match;
push @results, SomeObject->new( value => $1 ); # FIXME makes no sense

Fortunately Perl's regex engine has a pretty awesome feature that let you run code during a match. This is very useful for constructing data from intermittent match results without having to think about nested captures, especially since the $^N variable conveniently contains the result of the last capture.

Not worrying about nested captures is important when you're combining arbitrary patterns into larger ones. There's no reliable way to know where the capture result ends up so it's easiest to process it as soon as it's available.

qr{
    (\d+) # match some digits

    (?{
        # use the previous capture to produce a more useful result
        my $obj = SomeObject->new( value => $^N );

        # local allows backtracking to undo the effects of this block
        # this would have been much simpler if there was a purely
        # functional way to accumulate arbitrary values from regexes
        local @results = @results, $obj;
    })
}x;

Even though this is pretty finicky it still goes a long way. With this feature you can create regexes that also encapsulate the necessary post processing, while still remaining reusable.

Here is a hypothetical the definition of SomeObject:

package SomeObject;
use Moose;

has value => (
    isa => "Int",
    is  => "ro",
);

Constructing SomeObject is a purely functional operation: it has no side effects, and only returns a new object.

The only problem is that the above code is totally broken. It works, but only some of the time. The breakage is pretty random.

Did you spot the bug yet? No? But it's oh so obvious! Look inside Moose::Util::TypeConstraints::OptimizedConstraints and you will find the offending code:

sub Int { defined($_[0]) && !ref($_[0]) && $_[0] =~ /^-?[0-9]+$/ }

The constructor Moose generated for SomeObject is in fact not purely functional at all; though seemingly well behaved, in addition to returning an object it also the side effect of shitting all over the regexp engine's internal data structures, causing random values to be occasionally assigned to $^N (but only if invoked from inside a (?{ }) block during a match). You can probably imagine what a great time I had finding that bug.

What makes me sad is that the Int validation routine appears purely functional. It takes a value and then without modifying anything merely checks that it's defined, that it's not a reference, and that its stringified form contains only digits, returning a truth value as a result. All of the inputs and all of the outputs are clear, and therefore it seems only logical that this should be freely reusable.

When I came crying to #p5p it turned out that this is actually a known issue. I guess I simply shouldn't have expected the regexp engine to do such things, after all it has a very long history and these sorts of problems are somewhat typical of C code.

If the regexp engine was reentrant the what I tried to do would have just worked. Reentrancy guarantees one level of arbitrary combinations of code (the bit of reentrant code can be arbitrarily combined with itself). Unfortunately it seems very few people are actually in a position to fix it.

Purely functional code goes one step further. You can reliably mix and match any bit of code with any other bit of code, combining them in new ways, never having to expect failure. The price you have to pay is moving many more parameters around, but this is exactly what is necessary to make the boundaries well defined: all interaction between components is explicit.

When old code gets reused it will inevitably get prodded in ways that the original author did not think of. Functional code has a much better chance of not needing to be reimplemented, because the implementation is kept isolated from the usage context.

In short, every time you write dysfunctional code god kills a code reuse. Please, think of the code reuse!

Tuesday, November 10, 2009

Scoping of the current package

The one thing that I almost always notice when playing around in non Perl languages is how well Perl handles scoping.

There is one place in which Perl got it totally wrong though.

The value of the current package is lexically scoped:

package Foo;

{
    package Bar;
    say __PACKAGE__; # prints Bar
}

say __PACKAGE__; # prints Foo

However, the notion of the current package during compilation is dynamically scoped, even between files:

# Foo.pm:

package Foo;
use Bar;
# Bar.pm:

say __PACKAGE__; # prints Foo

In other words, if you don't declare a package at the top of the .pm file before doing anything, you are risking polluting the namespace of the module that called you. What's worse is that it can be unpredictable, only the first module to load Bar will leak into Bar.pm, so this could amount to serious debugging headaches.

Consider the following:

# Foo.pm:

package Foo;
use Moose;

use Bar;

sub foo { ... }

Now suppose a subsequent version of Bar is rewritten using MooseX::Declare:

use MooseX::Declare;

class Bar {
    ...
}

Guess which package the class keyword was exported to?

But maybe Bar was tidy and used namespace::clean; instead of making $foo_object->class suddenly start working, $foo_object->meta would suddenly stop working. And all this without a single change to Foo.pm.

Now imagine what would happen if Foo did require Bar instead of use

Anyway, I think the point was made, always declare your package upfront or you risk pooping on your caller. Anything you do before an explicit package declaration is in no man's land.

I'm pretty sure a future version of MooseX::Declare will contain a specific workaround for this, but I still think it's a good habit to always start every file with a package declaration, even if it's made redundant a few lines down.

Monday, November 2, 2009

Sub::Call::Recur

After my last post about Sub::Call::Tail melo and jrockway both asked me whether I was aware of Clojure's recur form. I wasn't. Shortly afterwords I wrote Sub::Call::Recur, which implements that form in Perl.

The recur operation is a tail call to the current subroutine. It's a bit like Perl's redo builtin, but for functions instead of blocks.

Here is a tail recursive factorial implementation:

sub fact {
    my ( $n, $accum ) = @_;

    $accum ||= 1;

    if ( $n == 0 ) {
        return $accum;
    } else {
        recur( $n - 1, $n * $accum );
    }
}

The difference between this and using Sub::Call::Tail to modify simple recursion is that recur is almost as fast as an iterative loop. The overhead of destroying and recreating the stack frame for the subroutine invocation is avoided.

I may end up combining the two modules so that a tail call resolving to the current subroutine is automatically optimized like recur, but I'm not sure if that's a good idea yet (the semantics are a little different; Sub::Call::Tail reuses the goto opcode, whereas recur is like a customized reimplementation of the redo opcode).