Wednesday, December 16, 2009

Ironman FAIL

Oops... I moved back to Chamonix over the weekend and completely forgot about blogging.

I guess I'll take a few days to get settled in and then start writing again. I'm aiming for chartreuse with alternating red and monkeyshit highlights and a fishnet, but unfortunately mst has been blogging much more consistently than me so far.

Friday, December 4, 2009

Simplifying BEGIN { } with Moose roles

This is a common Perl pattern:

package MyClass;
use Moose

use Try::Tiny;

use namespace::autoclean;

BEGIN {
    if ( try { require Foo; 1 } ) {
        *bar = sub {
            my $self = shift;
            Foo::foo($self->baz);
        };
    } else {
        *bar = sub {
            ... # fallback implementation
        };
    }
}

However, since this is a Moose class there is another way:

package MyClass;
use Moose

use Try::Tiny;

use namespace::autoclean;

with try { require Foo; 1 }
    ? "MyClass::Bar::Foo"
    : "MyClass::Bar::Fallback";
package MyClass::Foo;
use Moose::Role;

use Foo qw(foo);

use namespace::autoclean;

sub bar {
    my $self = shift;

    foo($self->baz);
}
package MyClass::Bar::Fallback;
use Moose::Role;

use namespace::autoclean;

sub bar {
    ...; # fallback implementation
}

Obviously for something that simple it doesn't make sense, but if there is more than one method involved, or the fallback implementation is a little long, it really helps readability in my opinion. Going one step further, you can create an abstract role like this:

package MyClass::Bar::API;
use Moose::Role;

use namespace::autoclean;

requires "bar";

and add it to the class's with statement to validate that all the required methods are really provided by one of the roles.

Role inclusion is usually thought of as something very static, but dynamism can be very handy without doesn't hurting the structure of the code.

If you want to be pedantic the role inclusion is not at compile time, but the loading of Foo is done at compile time inside the role (Foo is usually why it was in a BEGIN block in the first place, in most of the code I've seen).

Thursday, November 26, 2009

The timing of values in imperative APIs

Option configuration is a classic example of when I prefer a purely functional approach. This post is not about broken semantics, but rather about the tension between ease of implementation and ease of use.

Given Perl's imperative heritage, many modules default to imperative option specification. This means that the choice of one behavior over another is represented by an action (setting the option), instead of a value.

Actions are far more complicated than values. For starters, they are part of an ordered sequence. Secondly, it's hard to know what the complete set of choices is, and it's hard to correlate between choices. And of course the actual values must still be moved around.

A simple example is Perl's built in import mechanism.

When you use a module, you are providing a list of arguments that passed to two optional method calls on the module being loaded, import and VERSION.

Most people know that this:

use Foo;

Is pretty much the same as this:

BEGIN {
    require Foo;
    Foo->import();
}

There's also a secondary syntax, which allows you to specify a version:

use Foo 0.13 qw(foo bar);

The effect is the same as:

BEGIN {
    require Foo;
    Foo->VERSION(0.13);
    Foo->import(qw(foo bar));
}

UNIVERSAL::VERSION is pretty simple, it looks at the version number and compares it with $Foo::VERSION and then complains loudly if $Foo::VERSION isn't recent enough.

But what if we wanted to do something more interesting, for instance adapt the exported symbols to be compatible with a certain API version?

This is precisely why VERSION is an overridable class method, but this flexibility is still very far from ideal.

my $import_version;

sub VERSION {
    my ( $class, $version ) = @_;

    # first verify that we are recent enough
    $class->SUPER::VERSION($version);

    # stash the value that the user specified
    $import_version = $version;
}

sub import {
    my ( $class, @import ) = @_;

    # get the stashed value
    my $version = $import_version;

    # clear it so it doesn't affect subsequent imports
    undef $import_version;

    ... # use $version and @imports to set things up correctly
}

This is a shitty solution because really all we want is a simple value, but we have to juggle it around using a shared variable.

Since the semantics of import would have been made more complex by adding this rather esoteric feature, the API was made imperative instead, to allow things to be optional.

But the above code is not only ugly, it's also broken. Consider this case:

package Evil;
use Foo 0.13 (); # require Foo; Foo->VERSION;

package Innocent;
use Foo qw(foo bar); # require Foo; Foo->import;

In the above code, Evil is causing $import_version to be set, but import is never called. The next invocation of import comes from a completely unrelated consumer, but $import_version never got cleared.

We can't use local to keep $import_version properly scoped (it'd be cleared before import is called). The best solution I can come up with is to key it in a hash by caller(), which at least prevents pollution. This is something every implementation of VERSION that wants to pass the version to import must do to be robust.

However, even if we isolate consumers from each other, the nonsensical usage use Foo 0.13 () which asks for a versioned API and then proceeds to import nothing, still can't be detected by Foo.

We have 3 * 2 = 6 different code paths[1] for the different variants of use Foo, one of which doesn't even make sense (VERSION but no import), two of which have an explicit stateful dependency between two parts of the code paths (VERSION followed by import, in two variants), and two of which have an implicit stateful dependency (import without VERSION should get undef in $import_version). This sort of combinatorial complexity places the burden of ensuring correctness on the implementors of the API, instead of the designer of the API.

It seems that the original design goal was to minimize the complexity of the most common case (use Foo, no VERSION, and import called with no arguments), but it really makes things difficult for the non default case, somewhat defeating the point of making it extensible in the first place (what good is an extensible API if nobody actually uses it to its full potential).

In such cases my goal is often to avoid fragmenting the data as much as possible. If the version was an argument to import which defaulted to undef people would complain, but that's just because import uses positional arguments. Unfortunately you don't really see this argument passing style in the Perl core:

sub import {
    my ( $class, %args ) = @_;

    if ( exists $args{version} ) {
        ...
    }
    ... $args{import_list};
}

This keeps the values together in both space and time. The closest thing I can recall from core Perl is something like $AUTOLOAD. $AUTOLOAD does not address space fragmentation (an argument is being passed using a a variable instead of an argument), but it at leasts solves the fragmentation in time, the variable is reliably set just before the AUTOLOAD routine is invoked.

Note that if import worked like this it would still be far from pure, it mutates the symbol table of its caller, but the actual computation of the symbols to export can and should be side effect free, and if the version were specified in this way that would have been easier.

This is related to the distinction between intention and algorithm. Think of it this way: when you say use Foo 0.13 qw(foo bar), do you intend to import a specific version of the API, or do you intend to call a method to set the version of the API and then call a method to import the API? The declarative syntax has a close affinity to the intent. On the other hand, looking at it from the perspective of Foo, where the intent is to export a specific version of the API, the code structure does not reflect that at all.

Ovid wrote about a similar issue with Test::Builder, where a procedural approach was taken (diagnosis output is treated as "extra" stuff, not really a part of a test case's data).

Moose also suffers from this issue in its sugar layer. When a Moose class is declared the class definition is modified step by step, causing load time performance issues, order sensitivity (often you need to include a role after declaring an attribute for required method validation), etc.

Lastly, PSGI's raison d'etre is that the CGI interface is based on stateful values (%ENV, globally filehandles). The gist of the PSGI spec is encapsulating those values into explicit arguments, without needing to imperatively monkeypatch global state.

I think the reason we tend to default to imperative configuration is out of a short sighted laziness[2]. It seems like it's easier to be imperative, when you are thinking about usage. For instance, creating a data type to encapsulate arguments is tedius. Dealing with optional vs. required arguments manually is even more so. Simply forcing the user to specify everything is not very Perlish. This is where the tension lies.

The best compromise I've found is a multilayered approach. At the foundation I provide a low level, explicit API where all of the options are required all at once, and cannot be changed afterwords. This keeps the combinatorial complexity down and lets me do more complicated validation of dependent options. On top of that I can easily build a convenience layer which accumulates options from an imperative API and then provides them to the low level API all at once.

This was not done in Moose because at the time we did not know to detect the end of a .pm file, so we couldn't know when the declaration was finished[3].

Going back to VERSION and import, this approach would involve capturing the values as best we in a thin import (the sugar layer), and passing them onwards together to some underlying implementation that doesn't need to worry about the details of collecting those values.

In my opinion most of the time an API doesn't actually merit a convenience wrapper, but if it does then it's easy to develop one. Building on a more verbose but ultimately simpler foundation usually makes it much easier to write something that is correct, robust, and reusable. More importantly, the implementation is also easier to modify or even just replace (using polymorphism), since all the stateful dependencies are encapsulated by a dumb sugar layer.

Secondly, when the sugar layer is getting in the way, it can just be ignored. Instead of needing to hack around something, you just need to be a little more verbose.

Lastly, I'd also like to cite the Unix philosophy, another strong influence on Perl: do one thing, and do it well[4]. The anti pattern is creating one thing that provides two features: a shitty convenience layer and a limited solution to the original problem. Dealing with each concern separately helps to focus on doing the important part, and of course doing it well ;-)

This post's subject matter is obviously related to another procedural anti-pattern ($foo->do_work; my $results = $foo->results vs my $results = $foo->do_work). I'll rant about that one in a later post.

[1]

use Foo;
use Foo 0.13;
use Foo qw(foo bar);
use Foo 0.13 qw(Foo Bar);
use Foo ();
use Foo 0.13 ();

and this doesn't even account for manual invocation of those methods, e.g. from delegating import routines.

[2] This is the wrong kind of laziness, the virtuous laziness is long term

[3] Now we have B::Hooks::EndOfScope

[4] Perl itself does many things, but it is intended to let you write things that do one thing well (originally scripts, though nowadays I would say the CPAN is a much better example)

Saturday, November 21, 2009

Restricted Perl

zby's comments on my last post got me thinking. There are many features in Perl that we no longer use, or that are considered arcane or bad style, or even features we could simply live without. However, if they were removed, lots of code would break. So we keep those features, and we keep writing new code that uses them.

Suppose there was a pragma, similar to no indirect in that it restricts existing language features, and similar strict in that it lets you opt out of unrelated discouraged behaviors.

I think this would be an interesting baby step towards solving some of the problems that plague Perl code today:

  • Features that are often misused and need lots of critique.
  • Language features that are hard to change in the interpreter's implementation, limiting the revisions we can make to Perl 5.
  • Code that will be hard to translate to Perl 6, for no good reason.

On top of that one could implement several different defaults sets of feature-restricted Perl (sort of like Modern::Perl).

Instead of designing some sort of restricted subset of Perl 5 from the bottom up, several competing subsets could be developed organically, and if memory serves me right that is something we do quite well in our community =)

So anyway, what are some things that you could easily live without in Perl? What things would you be willing to sacrifice if it meant you could trade them off for other advantages? Which features would you rather disallow as part of a coding standard?

My take

Here are some ideas. They are split up into categories which are loosely related, but don't necessarily go hand in hand (some of them even contradict slightly).

They are all of a reasonable complexity to implement, either validating something or removing a language feature in a lexical scope.

It's important to remember that these can be opted out of selectively, when you need them, just like you can say no warnings 'uninitialized' when stringifying undef is something you intentionally allowed.

Restrictions that would facilitate static modularity

The first four restrictions make it possible to treat .pm files as standalone, cacheable compilation units. The fifth also allows for static linkage (no need to actually invoke import when evaluating a use statement), since the semantics of import are statically known. This could help alleviate startup time problems with Perl code, per complicit compilation unit (without needing to solve the problem as a whole by crippling the adhoc nature of Perl's compile time everywhere).

  • Disallow recursive require.
  • Disallow modification to a package's symbol table after its package declaration goes out of scope.
  • Restrict a file to to only one package (which must match the .pm file name).
  • Disallow modification of other packages other than the currently declared one.
  • Restrict the implementation of import to a statically known one.
  • Disallow access to external symbols that are not bound at compile time (e.g. variables from other packages, subroutines which weren't predeclared (fully qualified is OK).

Restrictions that allow easier encapsulation of side effects

These restrictions address pollution of state between unrelated bits of code that have interacting dynamic scopes.

  • Disallow modification of any global variables that control IO behavior, such as $/, $|, etc, as well as code that depends on them. IO::Handle would have to be augmented a bit to allow per handle equivalents, but it's most of the way there.
  • Disallow such variables completely, instead requiring a trusted wrapper for open that sets them at construction time and leaves them immutable thereafter.
  • Disallow /g matches on anything other than private lexicals (sets pos)
  • Disallow $SIG{__WARN__}, $SIG{__DIE__}, and $^S
  • Disallow eval (instead, use trusted code that gets local $@ right)
  • Disallow use of global variables altogether. For instance, instead of $! you'd rely on autodie, for @ARGV handling you'd use MooseX::Getopt or App::Cmd.
  • Disallow mutation through references (only private lexical variables can be modified directly, and complex data structures are therefore immutable after being constructed). This has far reaching implications for object encapsulation, too.

Restrictions that would encourage immutable data.

These restrictions alleviate some of the mutation centric limitations of the SV structure, that make lightweight concurrency impossible without protecting every variable access with a mutex. This would also allow aggressive COW.

  • Only allow assignment to a variable at its declaration site. This only applies to lexicals.
  • Allow only a single assignment to an SV (by reference or directly. Once an SV is given a value it becomes readonly)
  • Disallow assignment modification of external variables (non lexicals, and closure captures). This is a weaker guarantee than the previous one (which is also much harder to enforce), but with similar implications (all assignment is guaranteed to have side effects that outlive its lexical scope)

Since many of the string operations in Perl are mutating, purely functional variants should be introduced (most likely as wrappers).

Implicit mutations (such as the upgrading of an SV due to numification) typically results in a copy, so multithreaded access to immutable SVs could either pessimize the caching or just use a spinlock on upgrades.

Restrictions that would facilitate functional programming optimizations

These restrictions would allow representing simplified optrees in more advanced intermediate forms, allowing for interesting optimization transformations.

  • Disallow void context expressions
  • ...except for variable declarations (with the afore mentioned single use restrictions, this effectively makes every my $x = ... into a let style binding)
  • Allow only a single compound statement per subroutine, apart from let bindings (that evaluates to the return value). This special cases if blocks to be treated as a compound statement due to the way implicit return values work in Perl.
  • Disallow opcodes with non local side effects (including calls to non-verified subroutines) for purely functional code.

This is perhaps the most limiting set of restrictions. This essentially lets you embed lambda calculus type ASTs natively in Perl. Alternative representations for this subset of Perl could allow lisp style macros and other interesting compile time transformations, without the difficulty of making that alternative AST feature complete for all of Perl's semantics.

Restrictions that facilitate static binding of OO code

Perl's OO is always late bound, but most OO systems can actually be described statically. These restrictions would allow you to opt in for static binding of OO dispatch for a given hierarchy, in specific lexical scopes. This is a little more complicated than just lexical restrictions on features, since metadata about the classes must be recorded as well.

  • Only allow blessing into a class derived from the current package
  • Enforce my Class $var, including static validation of method calls
  • Disallow introduction of additional classes at runtime (per class hierarchy or alltogether)
  • Based on the previous two restrictions, validate method call sites on typed variable invocants as static subroutine calls (with several target routines, instead of one)
  • Similar the immutable references restriction above, disallow dereferencing of any blessed reference whose class is not derived from the current package.

Restrictions that are easy to opt in to in most code (opting out only as necessary)

These features are subject to lots of criticism, and their usage tends to be discouraged. They're still useful, but in an ideal world they would probably be implemented as CPAN modules.

  • Disallow formats
  • Disallow $[
  • Disallow tying and usage of tied variables
  • Disallow overloading (declaration of overloads, as well as their use)

A note about implementation

Most of these features can be implemented in terms of opcheck functions possibly coupled scrubbing triggered by and end of scope hook. Some of them are static checks at use time. A few others require more drastic measures. For related modules see indirect, Safe, Sys::Protect, and Devel::TypeCheck to name but a few

I also see a niche for modules that implement alternatives to built in features, disabling the core feature and providing a better alternative that replaces it instead of coexisting with it. This is the next step in exploratory language evolution as led by Devel::Declare.

The difficulty of modernizing Perl 5's internals is the overwhelming amount of orthogonal concerns whenever you try to implement something. Instead of trying to take care of these problems we could make it possible for the user to promise they won't be an issue. It's not ideal, but it's better than nothing at all.

The distant future

If this sort of opt-out framework turns out to be successful, there's no reason why use 5.20.0 couldn't disable some of the more regrettable features by default, so that you have to explicitly ask for them instead. This effectively makes Perl's cost model per-per-use, instead of always pay.

This would also increase the likelihood that people stop using such features in new code, and therefore the decision making aspects of the feature deprecation process would be easier to reason about.

Secondly, and perhaps more importantly, it would be possible to try for alternative implementations of Perl 5 with shorter termed deliverables.

Compiling a restricted subset of Perl to other languages (for instance client side JavaScript, different bytecodes, adding JIT support, etc) is a much easier task than implementing the language as a whole. If more feature restricted Perl code would be written and released on the CPAN, investments in such projects would be able to produce useful results sooner, and have clearer indications of progress.

Wednesday, November 18, 2009

Functional programming and unreasonable expectations

<record type="broken">I'm a big fan of purely functional programming</record>.

Another reason I like it so much is that purely functional software tends to be more reliable. Joe Armstrong of Erlang fame makes that point in an excellent talk much better than I could ever hope to.

However, one aspect he doesn't really highlight is that reliability is not only good for keeping your system running, it also makes it easier to program.

When a function is pure it is guaranteed to be isolated from other parts of the program. This separation is makes it much easier to change the code in one place without breaking anything unrelated.

Embracing this style of programming has had one huge drawback though: it utterly ruined my expectations of non functional code.

In imperative languages it's all too easy to add unstated assumptions about global state. When violated, these assumptions then manifest in very ugly and surprising ways (typically data corruption).

A good example is reentrancy (or rather the lack thereof) in old style C code. Reentrant code can be freely used in multiple threads, from inside signal handlers, etc. Conversely, non-reentrant routines may only be executed once at a given point in time. Lack of foresight in early C code meant that lots of code had to be converted to be reentrant later on. Since unstated assumptions are by definition hidden this can be a difficult and error prone task.

The specific disappointment that triggered this post is Perl's regular expression engine.

Let's say we're parsing some digits from a string and we want to create a SomeObject with those digits. Easy peasy:

$string =~ m/(\d+)/;
push @results, SomeObject->new( value => $1 );

Encapsulating that match into a resuable regex is a little harder though. Where does the post processing code go? Which capture variable does it use? Isolation would have been nice. The following example might work, but it's totally wrong:

my $match_digits = qr/(\d+)/;

my $other_match = qr{ ... $match_digits ... }x;

$string =~ $other_match;
push @results, SomeObject->new( value => $1 ); # FIXME makes no sense

Fortunately Perl's regex engine has a pretty awesome feature that let you run code during a match. This is very useful for constructing data from intermittent match results without having to think about nested captures, especially since the $^N variable conveniently contains the result of the last capture.

Not worrying about nested captures is important when you're combining arbitrary patterns into larger ones. There's no reliable way to know where the capture result ends up so it's easiest to process it as soon as it's available.

qr{
    (\d+) # match some digits

    (?{
        # use the previous capture to produce a more useful result
        my $obj = SomeObject->new( value => $^N );

        # local allows backtracking to undo the effects of this block
        # this would have been much simpler if there was a purely
        # functional way to accumulate arbitrary values from regexes
        local @results = @results, $obj;
    })
}x;

Even though this is pretty finicky it still goes a long way. With this feature you can create regexes that also encapsulate the necessary post processing, while still remaining reusable.

Here is a hypothetical the definition of SomeObject:

package SomeObject;
use Moose;

has value => (
    isa => "Int",
    is  => "ro",
);

Constructing SomeObject is a purely functional operation: it has no side effects, and only returns a new object.

The only problem is that the above code is totally broken. It works, but only some of the time. The breakage is pretty random.

Did you spot the bug yet? No? But it's oh so obvious! Look inside Moose::Util::TypeConstraints::OptimizedConstraints and you will find the offending code:

sub Int { defined($_[0]) && !ref($_[0]) && $_[0] =~ /^-?[0-9]+$/ }

The constructor Moose generated for SomeObject is in fact not purely functional at all; though seemingly well behaved, in addition to returning an object it also the side effect of shitting all over the regexp engine's internal data structures, causing random values to be occasionally assigned to $^N (but only if invoked from inside a (?{ }) block during a match). You can probably imagine what a great time I had finding that bug.

What makes me sad is that the Int validation routine appears purely functional. It takes a value and then without modifying anything merely checks that it's defined, that it's not a reference, and that its stringified form contains only digits, returning a truth value as a result. All of the inputs and all of the outputs are clear, and therefore it seems only logical that this should be freely reusable.

When I came crying to #p5p it turned out that this is actually a known issue. I guess I simply shouldn't have expected the regexp engine to do such things, after all it has a very long history and these sorts of problems are somewhat typical of C code.

If the regexp engine was reentrant the what I tried to do would have just worked. Reentrancy guarantees one level of arbitrary combinations of code (the bit of reentrant code can be arbitrarily combined with itself). Unfortunately it seems very few people are actually in a position to fix it.

Purely functional code goes one step further. You can reliably mix and match any bit of code with any other bit of code, combining them in new ways, never having to expect failure. The price you have to pay is moving many more parameters around, but this is exactly what is necessary to make the boundaries well defined: all interaction between components is explicit.

When old code gets reused it will inevitably get prodded in ways that the original author did not think of. Functional code has a much better chance of not needing to be reimplemented, because the implementation is kept isolated from the usage context.

In short, every time you write dysfunctional code god kills a code reuse. Please, think of the code reuse!

Tuesday, November 10, 2009

Scoping of the current package

The one thing that I almost always notice when playing around in non Perl languages is how well Perl handles scoping.

There is one place in which Perl got it totally wrong though.

The value of the current package is lexically scoped:

package Foo;

{
    package Bar;
    say __PACKAGE__; # prints Bar
}

say __PACKAGE__; # prints Foo

However, the notion of the current package during compilation is dynamically scoped, even between files:

# Foo.pm:

package Foo;
use Bar;
# Bar.pm:

say __PACKAGE__; # prints Foo

In other words, if you don't declare a package at the top of the .pm file before doing anything, you are risking polluting the namespace of the module that called you. What's worse is that it can be unpredictable, only the first module to load Bar will leak into Bar.pm, so this could amount to serious debugging headaches.

Consider the following:

# Foo.pm:

package Foo;
use Moose;

use Bar;

sub foo { ... }

Now suppose a subsequent version of Bar is rewritten using MooseX::Declare:

use MooseX::Declare;

class Bar {
    ...
}

Guess which package the class keyword was exported to?

But maybe Bar was tidy and used namespace::clean; instead of making $foo_object->class suddenly start working, $foo_object->meta would suddenly stop working. And all this without a single change to Foo.pm.

Now imagine what would happen if Foo did require Bar instead of use

Anyway, I think the point was made, always declare your package upfront or you risk pooping on your caller. Anything you do before an explicit package declaration is in no man's land.

I'm pretty sure a future version of MooseX::Declare will contain a specific workaround for this, but I still think it's a good habit to always start every file with a package declaration, even if it's made redundant a few lines down.

Monday, November 2, 2009

Sub::Call::Recur

After my last post about Sub::Call::Tail melo and jrockway both asked me whether I was aware of Clojure's recur form. I wasn't. Shortly afterwords I wrote Sub::Call::Recur, which implements that form in Perl.

The recur operation is a tail call to the current subroutine. It's a bit like Perl's redo builtin, but for functions instead of blocks.

Here is a tail recursive factorial implementation:

sub fact {
    my ( $n, $accum ) = @_;

    $accum ||= 1;

    if ( $n == 0 ) {
        return $accum;
    } else {
        recur( $n - 1, $n * $accum );
    }
}

The difference between this and using Sub::Call::Tail to modify simple recursion is that recur is almost as fast as an iterative loop. The overhead of destroying and recreating the stack frame for the subroutine invocation is avoided.

I may end up combining the two modules so that a tail call resolving to the current subroutine is automatically optimized like recur, but I'm not sure if that's a good idea yet (the semantics are a little different; Sub::Call::Tail reuses the goto opcode, whereas recur is like a customized reimplementation of the redo opcode).

Saturday, October 31, 2009

Sub::Call::Tail

I've just released Sub::Call::Tail which allows for a much more natural tail call syntax than Perl's goto built in.

It provides a tail keyword that modifies normal invocations to behave like goto &sub, without needing the ugly @_ manipulation.

Instead of this horrible kludge:

@_ = ( $foo, $bar );
goto &foo;

You can now write:

tail foo($foo, $bar);

And much more importantly this method call emulation atrocity:

@_ = ( $object, $foo, $bar );
goto $object->can("foo");

Can now be written as:

tail $object->foo($foo, $bar);

Finally we can write infinitely tail recursive and CPS code with a constant stack space, without the syntactic letdown that is goto. Lambdacamels rejoice!

Thanks so much to Zefram for his numerous tests and contributions.

Wednesday, October 28, 2009

Versioned site_lib

Today I wanted to install a simple module on a production machine. I used the CPAN utility, as usual. Unfortunately that also pulled in an upgraded dependency which was not backwards compatible, breaking the application.

I hate yak shaving.

But not nearly as much as I hate surprise yak shaving.

I want to fix compatibility problems in my development environment on my own time, not hastily on a live server.

I wrote a small module to address this problem. To set it up run git site-perl-init. This will initialize the .git directory and configure CPAN to wrap make install and ./Build install with a helper script.

The wrapper will invoke the installation command normally, and then commit any changes to installsitelib with the distribution name as the commit message. This will happen automatically every time CPAN tells a module to install itself.

The approach is very simplistic; it does not version manpages or the bin directory, nor does it work with local::lib or CPANPLUS (at least not yet).

It is just enough to let me run git reset --hard "master@{1 hour ago}" to instantly go back to a working setup.

Friday, October 23, 2009

Authenticated Encryption

One thing that makes me cringe is when people randomly invent their own cryptographic protocols. There's a Google Tech Talk by Nate Lawson where he explains some surprising approaches to attacking a cryptographic algorithm. It illustrates why rolling your own is probably a bad idea ;-)

Perhaps the most NIH cryptographic protocol I've seen is digitally signing as well as encrypting a message, in order to store tamper resistant data without revealing its contents. This is often done for storing sensitive data in cookies.

Obviously such a protocol can be built using HMACs and ciphers, but high level tools are already available, ones that have already been designed and analyzed by people who actually know what they're doing: authenticated encryption modes of operation.

WTF is a cipher mode?

Block ciphers are the sort of like hash functions, they take a block of data and scramble the block.

Simply encrypting your data blocks one by one is not a good way of securing it though. Wikipedia has a striking example:

Even though every pixel is encrypted, the data as a whole still reveals a lot.

Suffice it to say that blocks of operation are a wrapper that takes a low level scrambling function, the block cipher, and provide a less error prone tool, one that is more difficult to misuse.

On the CPAN

Crypt::CBC and Crypt::Ctr are implementations of some of the more classic cipher modes. But this post is ranting about people not using authenticated modes.

Crypt::GCM and Crypt::EAX implement two different AEAD modes of operation using any block cipher.

These are carefully designed and analyzed algorithms, and the CPAN implementations make use of the tests from the articles describing the algorithms, so it sure beats rolling your own.

Secondly, Crypt::Util provides a convenience layer that builds on these tools (and many others), so perhaps Crypt::Util already handles what you want.

To tamper protect a simple data structure you can do something like this:

my $cu = Crypt::Util->new( key => ... );

my $ciphertext = $cu->tamper_proof_data( data = { ... }, encrypt => 1 );

Crypt::Util will use Storable to encode the data into a string, and then use an authenticated encryption mode to produce the ciphertext.

To decrypt, simply do:

my $data = $c->thaw_tamper_proof( string => $ciphertext );

Crypt::Util will decrypt and validate the ciphertext, and only after it's sure that the data is trusted it'll start unpacking the data, and if appropriate using Storable to deserialize the message. All allocations based on untrusted data are limited to 64KiB.

Don't sue me

I'm not saying that the CPAN code is guaranteed to be safe. I'm saying this is a better idea than rolling your own. If your application is sensitive you have no excuse not to open up the code and audit it.

Friday, October 16, 2009

Event driven PSGI

I spent most of today and yesterday bikeshedding event driven PSGI with miyagawa on #http-engine.

We seem to have converged on something that is both fairly portable to different event driven implementations, without being too yucky for blocking backends.

For example, if you don't yet know the response code or headers and are waiting on some other event driven thing, it's sort of like continuation passing style:

$app = sub {
    my $env = shift;

    ...

    return sub {
        my $write = shift;

        $some_event_thingy->do_your_thing( when_finished => sub {
            $write->([ 200, $headers, $body ]);
        });
    };
};

A more complex example involves streaming:

$app = sub {
    my $env = shift;

    ...

    return sub {
        my $write = shift;

        my $out = $write->([ 200, $headers ]);

        $some_event_thingy->new_data(sub {
            my $data = shift;

            if ( defined $data ) {
                $out->write($data);
            } else {
                $out->close;
            }
        });
    };
};

Lastly, if you are worried about too much memory usage in the output buffer, you can provide a callback to poll_cb:

$app = sub {
    my $env = shift;

    ...

    return sub {
        my $write = shift;

        $write->([ 200, $headers ])->poll_cb(sub {
            my $out = shift;

            $out->write($some_more);

            $out->close() if $finished;
 });
    };
};

But poll_cb should only be used on event driven backends (check for it using $out->can("poll_cb")).

This lets simple streaming applications will work nicely under blocking backends as well as event driven ones.

Even better, while I was busy implementing this this for the AnyEvent backend frodwith whipped up a POE implementation in no time at all.

This pretty much obsoletes my IO::Writer sketch. The only case it doesn't cover but which IO::Writer theoretically does is poll_cb based nonblocking output, combined with a non blocking data source, but without an event driven environment. This sucks because nonblocking IO without an event loop wastes a lot of CPU. I can't imagine why anyone would actually try that, so I hereby declare IO::Writer deprecated, thankfully before I actually wrote a robust implementation ;-)

Thursday, October 8, 2009

Roles and Delegates and Refactoring

Ovid writes about the distinction between responsibility and behavior, and what that means in the context of roles.

He argues that the responsibilities of a class may sometimes lie in tangent with additional behaviors it performs (and that these behaviors are often also in tangent with one another).

Since roles lend themselves to more horizontal code reuse (what multiple inheritance tries to allow but fails to do safely), he makes the case that they are they are more appropriate for loosely related behaviors.

I agree. However, roles only facilitate the detection of a flawed taxonomy, which under multiple inheritance seems to work. They can often validate a sensible design, but they don't provide a solution for a flawed one.

If you take a working multiple inheritance based design and change every base class into a role, it will still work. Roles will produce errors for ambiguities, but if the design makes sense there shouldn't be many of those to begin with. The fundamental structure of the code hasn't actually changed with the migration to roles.

Roles do not in their own right prevent god objects from forming. Unfortunately that has not yet been automated ;-)

Another Tool

Wikipedia defines Delegation as:

a technique where an object outwardly expresses certain behaviour but in reality delegates responsibility for implementing that behavior to an associated object

instead of merging the behavior into the consuming class (using roles or inheritence), the class uses a helper object to implement that behavior, and doesn't worry about the details.

Roles help you find out you have a problem, but delegates help you to fix it.

Delegation by Example

A simple but practical example of how to refactor a class that mixes two behaviors is Test::Builder:

  • It provides an API to easily generate TAP output
  • It provides a way to share a TAP generator between the various Test:: modules on the CPAN, using the singleton pattern.

Test::Builder's documentation says:

Since you only run one test per program new always returns the same Test::Builder object.

The problem is that the assumption that you will only generate one stream of TAP per program hasn't got much to do with the problem of generating valid TAP data.

That assumption makes it simpler to generate TAP output from a variety of loosely related modules designed to be run with Test::Harness, but it is limiting if you want to generate TAP in some other scenario.[1]

With a delegate based design the task of obtaining the appropriate TAP generation helper and the task of generating TAP output would be managed by two separate objects, where the TAP generator is oblivious to the way it is being used.

In this model Test::Builder is just the singletony bits, and it uses a TAP generation helper. It would still have the same API as it does now, but a hypothetical TAP::Generator object would generate the actual TAP stream.

The core idea is to separate the behaviors and responsibilities even more, not just into roles, but into different objects altogether.

Though this does makes taxonomical inquiries like isa and does a little more roundabout, it allows a lot more flexibility when weaving together a complex system from simple parts, and encourages reuse and refactoring by making polymorphism and duck typing easy.

If you want to use TAP for something other than testing Perl modules, you could do this without hacking around the singleton crap.

Delegating with Moose

Moose has strong support for delegation. I love this, because it means that convincing people to use delegation is much easier than it was before, since it's no longer tedius and doesn't need to involve AUTOLOAD.

To specify a delegation, you declare an attribute and use the handles option:

has tap_generator => (
    isa => "TAP::Generator",
    is  => "ro",
    handles => [qw(plan ok done_testing ...)],
);

Roles play a key part in making delegation even easier to use. This is because roles dramatically decrease the burden of maintenance and refactoring, for all the reasons that Ovid often cites.

When refactoring role based code to use delegation, you can simply replace your use of the role with an attribute:

has tap_generator => (
    does => "TAP::Generator",
    is   => "ro",
    handles => "TAP::Generator",
);

This will automatically proxy all of the methods of the TAP::Genrator role to the tap_generator attribute.[2]

Moose's handles parameter to attributes has many more features which are covered in the Delegation section of the manual.

A Metaclass Approach to ORMs

Ovid's example for roles implementing a separate behavior involves a simple ORM. It involves a Server class, which in order to behave appropriately needs some of its attributes stored persistently (there isn't much value in a server management system that can't store information permanently).

He proposes the following:

class Server does SomeORM {
   has IPAddress $.ip_address is persisted;
   has Str       $.name       is persisted;

   method restart(Bool $nice=True) {
       say $nice ?? 'yes' !! 'no';
   }
}

But I think this confuses the notion of a class level behavior with a metaclass level behavior.

The annotation is persisted is on the same level as the annotation IPAddress or Str, it is something belonging to the meta attribute.

Metaclasses as Delegates

A metaclass is an object that represents a class. In a sense it could be considered a delegate of the compiler or language runtime. In the case of Moose this is a bit of a stretch (since the metaclass is not exactly authoritative as far as Perl is concerned, the symbol table is).

Conceptually it still holds though. The metaclass is responsible for reflecting as well as specifying the definition of a single class. The clear separation of that single responsibility is the key here. The metaclass is delegated to by the sugar layer that uses it, and indirectly by the runtime that invokes methods on the class (since the metaclass is in control of the symbol table).

Furthermore, the metaclass itself delegates many of its behaviors. Accessor generation is the responsibility of the attribute meta object.

To frame the ORM example in these terms, we have several components:

  • persisted attributes, modeled by meta attribute delegate of the meta class object with an additional role for persistence[3]
  • the metaclass, which must also be modified for the additional persistence functionality (to make use of the attributes' extended interface)
  • an object construction helper that knows about the class it is constructing, as well as the database handle from which to get the data, but doesn't care about the actual problem domain.[4]
  • an object that models information about the problem domain being addressed (Server)

By separating the responsibilities of the business logic from database connectivity from class definition we get decoupled components that can be reused more easily, and which are less sensitive to changes in one another.

KiokuDB

Lastly, I'd like to mention that KiokuDB can be used to solve that Server problem far more simply. I promise I'm not saying that only on account of my vanity ;-)

The reason it's a simpler solution is that the Server class does not need to know that it is being persisted at all, and therefore does not need to accommodate a persistence layer. The KiokuDB handle would be asked to persist that object, and proceed to take it apart using reflection provided by the metaclass:

$dir->lookup($server_id);
$server->name("Pluto");
$dir->update($server);

This keeps the persistence behavior completely detached from the responsibilities of the server, which is to model a physical machine.

The problem of figuring out how to store fields in a database can be delegated to a completely independent part of the program, which is operates on Server via its metaclass, instead of being a tool that Server uses. The behavior or responsibility (depending on how you look at it) of storing data about servers in a database can be completely removed from the Server class, which is concerned solely with the shape of that data.

Summary

There is no silver bullet.

Roles are almost always better than multiple inheritance, but don't replace some of the uses of single inheritance.

Delegates provide even more structure than roles, and are usually best implemented using roles.

By leveraging both techniques at the class as well as the metaclass level you can often achieve dramatically simplified results.

Roles may help with code reuse, but the classes they create are still static (even runtime generated classes are still classes with a symbol table). Delegation allows components to be swapped and combined much more easily. When things get more complicated inversion of control goes even further, and the end result is usually both more flexible and simpler than only static role composition.

Secondly, and perhaps more importantly, delegates are not limited to single use[5]. You can have a list of delegates performing a responsibility together. Sartak's API Design talk at YAPC::Asia explained how Dist::Zilla uses a powerful combination of roles and plugin delegates, taking this even further.

At the bottom line, though, nothing can replace a well thought out design. Reducing your problem space is often the best way of finding a clean solution. What I like so much about delegates is that they encourage you to think about the real purpose of each and every component in the system.

Even the simple need for coming up with a name for each component can help you reach new understandings about the nature of the problem.

Delegation heavy code tends to force you to come up with many names because there are many small classes, but this shouldn't lead to Java hell. Roles can really help alleviate this (they sure beat interfaces), but even so, this is just code smell that points to an overly complex solution. If it feels too big it probably is. A bloated solution that hasn't been factored out to smaller parts is still a bloated solution.

Once you've taken the problem apart you can often figure out which parts are actually necessary. Allowing (and relying on) polymorphism should make things future proof without needing to implement everything up front. Just swap your simple delegate with a more complicated one when you need to.

[1] Fortunately Test::Builder provides an alternate constructor, create, that is precisely intended for this case.

[2] Note that this is currently flawed in Moose for several reasons: accessors are not delegated automatically, the ->does method on the delegator will not return true for roles of the delegate (specifically "RoleName" in this case, etc), but that's generally not a problem in practice (there are failing tests and no one has bothered to fix them yet).

[3] For clarity's sake we tend to call roles applied to meta objects traits, so in this case the Server class would be using the SomeORM and persistent class and attribute traits.

[4] in DBIC these responsibilities are actually carried out by the resultset, the result source, and the schema objects, a rich hierarchy of delegates in its own right.

[5] parameterized roles are a very powerful static abstraction that allows multiple compositions of a single role into a single consumer with different paremeters.

Monday, October 5, 2009

Are Filehandles Objects?

Perl has a very confusing set of behaviors for treating filehandles as objects.

ADHD Summary

Globs which contain open handles can be treated as objects, even if though they aren't blessed.

Always load IO::Handle and FileHandle, to allow the method syntax.

Whenever you are using filehandles, use the method syntax, regardless of whether it's a real handle or a fake one. A fake handle that works with the builtins needs to jump through some nasty hoops.

Whenever you are creating fake handles, the aforementioned hopps are that you should tie *{$self} (or return a tied glob from the *{} overload), so that the builtins will know to call your object's methods through your TIEHANDLE implementation.

The Long Story

There are several potentially overlapping data types that can be used to perform IO.

  • Unblessed globs containing IO objects
  • Blessed globs containing IO objects
  • Blessed IO objects
  • Objects resembling IO objects (i.e. $obj->can("print")), which aren't necessarily globs
  • Tied globs

and there are two ways to use these data types, but some types only support one method:

  • method calls ($fh->print("foo"))
  • builtins (print $fh, "foo")

Lastly, there are a number of built in classes (which are not loaded by default, but are in core):

When you open a standard filehandle:

use autodie;

open my $fh, "<", $filename;

the variable $fh contains an unblessed reference to a type glob. This type glob contains an IO reference in the IO slot, that is blessed into the class FileHandle by default.

The IO object is blessed even if FileHandle is not loaded.

If it is loaded, then both ways of using the handle are allowed:

# builtin
read $fh, my $var, 4096;

# as method
$fh->read(my $var, 4096);

These two forms behave similarly. One could even be led to believe that the first form is actually treated as indirect method syntax. This is unfortunately very far from the truth.

In the first form the read is executed directly. In the second form the read method would be invoked on the unblessed glob. However, instead of throwing the usual Can't call method "read" on unblessed reference error, Perl's method_common routine (which implements method dispatch) special cases globs with an IO slot to actually dispatch the method using the class of *{$fh}{IO}. In a way this is very similar to how autobox works.

If you remembered to use FileHandle then this should result in a successful dispatch to IO::Handle::read, which actually delegates to the read builtin:

sub read {
    @_ == 3 || @_ == 4 or croak 'usage: $io->read(BUF, LEN [, OFFSET])';
    read($_[0], $_[1], $_[2], $_[3] || 0);
}

When you create nonstandard IO objects, this breaks down:

{
    package MyHandle;

    sub read {
        my ( $self, undef, $length, $offset ) = @_;

        substr($_[1], $offset, $length) = ...;
    }
}

because now $myhandle->read(...) will work as expected, but read($myhandle, ...) will not. If it is a blessed glob the error will be read() on unopened filehandle, and for other data types the error will be Not a GLOB reference.

Tied Handles

Tied handles are very similar to built in handles, they contain an IO slot with a blessed object (this time the default is the FileHandle class, a subclass of IO::Handle that has a few additional methods like support for seeking.

The IO object is marked as a tied data structure so that the builtin opcodes will delegate to some object implementing the TIEHANDLE interface.

A method call on a tied object will therefore invoke the method on the FileHandle class, which will delegate to to the builtin, which will delegate to a method on the object.

Because of this, classes like IO::String employ a clever trick to allow the builtins to be used:

my $fh = IO::String->new($string);

# readline builtin calls getline method
while ( defined( my $line = <$fh> ) ) {
    ...
}

# print builtin calls print method
print $fh "foo";

IO::String::new ties the new handle to itself[1], and sets up the extra glue:

sub new
{
    my $class = shift;
    my $self = bless Symbol::gensym(), ref($class) || $class;
    tie *$self, $self;
    $self->open(@_);
    return $self;
}

The TIEHANDLE API is set up by aliasing symbols:

*READ   = \&read;

Effectively this reverses the the way it normally works. Instead of IO::Handle methods delegating to the builtin ops, the builtin ops delegate to the IO::String methods.

IO::Handle::Util provides a io_to_glob helper function which produces a tied unblessed glob that delegates to the methods of an IO object. This function is then used to implement *{} overloading. This allows non glob handles to automatically create a working glob as necessary, without needing to implement the tie kludge manually.

Conclusion

When working with standard or nonstandard handle types, method syntax always works (provided IO::Handle and FileHandle are loaded), but the builtin syntax only works for tied handles, so when using a filehandle I prefer the method syntax.

It also makes the silly print {$fh} idiom unnecessary, since direct method syntax isn't ambiguous.

Performance is a non issue, the extra overhead is nothing compared to PerlIO indirection, ties, and making the actual system calls.

However, when creating nonstandard IO objects, you should probably provide a tie fallback so that code that doesn't use method syntax will not die with strange errors (or worse, violate the encapsulation of your handle object and just work on the internal glob structure directly).

This is one of my least favourite parts in Perl. It's such a horrible cascade of kludges. In the spirit of Postel's Law, consistently using methods is the conservative thing to do (it works in all cases), and providing a tie based fallback is a way to be liberal about what you accept.

[1] Surprisingly this doesn't leak, even though weaken(tie(*$self, $self)) is not used. I suspect there is a special case that prevents the refcount increment if the object is tied to itself. See also Tie::Util

Thursday, October 1, 2009

IO::Handle::Util

My friend miyagawa has been championing PSGI and its reference implementation, Plack. This is something we've needed for a long time: a clean and simple way to respond to HTTP requests without the cruft of CGI and %ENV.

The PSGI specification requires the body of the response to be represented using an object similar to IO::Handle.

I've released IO::Handle::Util, a convenience package designed to make working with IO::Handle like objects easier.

The main package provides a number of utility functions for creating IO handles from various data structures, and for getting the data out of IO handles.

For example, if you have a string or array of strings that you would like to just pass back, you can use the io_from_any function:

my $io = io_from_any $body;

# then you can do standard operations to get the data out:
$io->getline;
$io->read(my $str, $length);

This function will sensibly coerce other things, like paths, and let already working handles pass through as is.

If you have an iterator callback that gets more data you can also use that:

my $io = io_from_getline sub {
    ...
    return $more_data; # or undef when you're done
};

This is not automated by io_from_any because you can also use a writer callback, or a callback that returns the whole handle (remember, this is not PSGI specific).

You can also go the other, taking IO handles and getting useful things out of them. io_to_array is pretty obvious, but you can also do something like:

use autodie;

open my $fh, ">", $file;

my $cb = io_to_write_cb $fh;

$cb->("blah\n");
$cb->("orz\n");

Many of the utility functions are based on IO::Handle::Iterator and IO::Handle::Prototype::Fallback, two classes which facilitate the creation of adhoc filehandles.

Hopefully this will make creating and working with IO handles a little quicker and easier.

Wednesday, September 23, 2009

Testing for Race Conditions

Concurrent programming is hard for many reason. One of them is that scheduling isn't predictable; the same code may behave differently depending on the OS's scheduling decisions. This is more complicated on single CPU machines when there is no real concurrency, just the illusion of concurrency through time sharing.

Because many operations can happen in a single time slice (which is usually 10ms on most operating systems) unless the race condition falls on the time slice boundary it may go undetected in testing.

My laptop can do several thousand IO operations and many more thread mutex operations per second. Chances are only if one of these has a race condition it'll take very long to find it accidentally.

Since most concurrent code tries to minimize the duration of critical sections, the probability of critical sections interleaving with context switches can be very low. On a single CPU machine all the non blocking operations performed in a single time slice are atomic from the point of view of other processes. For a computer 10ms is a pretty long length of time.

Therefore, when testing for concurrency bugs it's important to introduce artificial contention to increase the chance of exposing a race condition. A fairly reliable method of doing that which I use in Directory::Transactional's test suite is introducing random delays into the process:

use Time::HiRes qw(sleep);
use autodie;

sub cede {
    sleep 0.01 if rand < 0.2;
}

for ( 1 .. $j ) {
    fork and next;

    srand $$;

    for ( 1 .. $n ) {
        # critical section
        cede();
        acquire_resource();
        cede();
        read_resource();
        cede();
        write_resource();
    }

    exit;
}

wait_all();

This is very easy to add only during testing by wrapping the resource access APIs in your codebase.

If you concurrently loop through the critical section long enough the probability of finding a race condition is much higher than without the delays that encourage the OS to perform context switching even on a single CPU machine.

This is by no means an exhaustive method of checking all possible context switch permutations. However, by increasing the probability of interleaved context switches from infinitesimally small to pretty good for a reasonable number of iterations we are much more likely to trigger any race condition (as opposed to all of them).

Furthermore, by designing volatile test transactions the chance of detecting this race condition is also quite good (since an exposed race condition would be more likely to cause inconsistency). Specifically in Directory::Transactional the test simulates money transfers between three accounts, where each iteration deducts a random number from a random account and adds it to another.

The values are read before a delay, and written afterwords, so any other process touching the same account would trigger a race condition if proper locking was not in place. The fact that the values are random too increases the chance of corruption being detectable (it's much more likely the accounts will not be balanced if the value is random as opposed to some constant).

Coupled with concurrent consistency checks on the data done at intervals this was a pretty effective method for detecting race conditions, quickly exposing initial implementation flaws in Directory::Transactional that took very many iterations to reproduce at first. Throwing in random kill -9 signals provided a fun test for journal replay recovery.

Friday, September 18, 2009

Method Style Callbacks

This is a small trick I like to use when defining callbacks. Instead of invoking callbacks as code references:

$callback->($data);

I always write the invocation as a method call:

$data->$callback();

or if $data is not a natural invocant for the routine, then I invoke it as a method on the object that is sending the notification:

$self->$callback($data);

This works in more cases than just plain code reference invocation:

  • If $data is an object then $callback can be a string (a method name).
  • If $callback is a code reference there is no difference between this syntax and code dereferncing
  • Even if $data is not an object but $callback is a code reference, the method syntax will still work.

It's pretty nice to be able to pass method names as callbacks invoked on a subclass but still have the flexibility of a code reference when appropriate, and works especially well when the callback attribute has a sensible default value that resolves to a method call on $self to allow easy overriding of behavior without needing a full subclass.

Tuesday, September 15, 2009

Hackathon Summary

During the YAPC::Asia::2009 hackathon I refactored a bunch of XS modules out of some other code, both rafl's and mine.

XS::Object::Magic

This module provides an alternative to the standard T_PTROBJ approach to creating C struct based objects in Perl.

The traditional way creates a blessed scalar reference, which contains an integer value that is cast to a pointer of the right type.

This is problematic for two reasons:

  • If the pointer value is modified (accidentally or maliciously) then this could easily corrupt memory or cause segfaults.
  • Scalar references can't be extended with more data without using inside out objects.

This module provides a C api which uses sv_magicext to create safer and more extensible objects associated with some pointer, that interoperates nicely with Moose amongst other things.

Magical::Hooker::Decorate

This module lets you decorate an arbitrary SV with any other arbitrary SV using the magic hook API.

This is similar to using a fieldhash to create inside out decorations, but is designed to be used primarily from XS code, and as such it provides a C api. The memory footprint is also slightly smaller since there is no auxillary storage.

B::Hooks::XSUB::CallAsOp

This module was refactored out of Continuation::Delimited and lets an XS routine trigger code that will be invoked in the context of its caller.

This allows a little more freedom in the stack manipulations you can do (which is of course very important for continuations).

These modules are all possible due to the awesome ExtUtils::Depends module. If you find yourself cargo culting XS consider refactoring it into a reusable module using ExtUtils::Depends instead.

Saturday, September 12, 2009

Git Hate

I like Git, but sometimes I also really hate it.

My friend Sartak asked if there is a better way of finding a merge base in a command than hard coding master as the merge point, and I stupidly said "of course, you just check what that branch tracks". That is the same logic that git pull would use, so in theory you can apply the same concise logic to merge or rebase your branch without running git fetch.

In principle that's correct, but in practice it's utterly worthless.

The first step is to figure out what branch you are on.

You could easily do this with the git-current-branch command. Except that it doesn't actually exist. Instead you need to resolve the symbolic ref in head and truncate that string:

$(git symbolic-ref -q HEAD | sed -e 's/^refs\/heads\///'

Except that that's actually broken if the symbolic ref points at something not under heads.

Ignoring that, we now have a string with which we can get our merge metadata:

branch="$( git current-branch )"

git config --get "branch.$branch.remote"
git config --get "branch.$branch.merge"

The problem lies in the fact that the tracking refers to the remote, and the ref on the remote:

[branch "master"]
 remote = origin
 merge = refs/heads/master

But we want to use the local ref. Based on git config, we can see that this value should be refs/remotes/origin/master:

[remote "origin"]
 url = git://example.com/example.git
 fetch = +refs/heads/*:refs/remotes/origin/*

If you read the documentation for git-fetch you can see the description of the refspec for humans, but it's pretty feature rich.

Fortunately Git already implements this, so it shouldn't be too hard. Too bad that it is.

In builtin-fetch.c are a number of static functions that you could easily copy and paste to do this refspec evaluation in your own code.

So in short, what should have been a simple helper command has degenerated into an epic yak shave involving cargo culting C and reimplementing commands that should have built in to begin with.

And for what?

git merge-base $( git-merge-branch $( git current-branch ) ) HEAD
instead of
git merge-base master HEAD

In short, it's completely non-viable to do the right thing, nobody has that much spare time. Instead people just kludge it. These kludges later come back with a vengeance when you assume something does the right thing, and it actually doesn't handle an edge case you didn't think about before you invoked that command.

I'm not trying to propose a solution, it's been obvious what the solution is for years (a proper libgit). What makes me sad is that this is still a problem today. I just can't fathom any reason that would justify not having a proper way to do this that outweighs the benefit of having one.

Friday, September 11, 2009

YAPC::Tiny, YAPC::Asia

I'd like to thank to everyone that made my trip to Asia so great thus far.

To list but a few: thanks to the JPA for organizing YAPC::Asia, especially Daisuke for being so accommodating when it wasn't clear if I would make it till the end, gugod for organizing YAPC::Tiny, Kenichi Ishigaki and Kato Atsushi for translating my talks (though only one was presented in the end), clkao and audreyt for being such great hosts in Taiwan, Dan Kogai for hosting us at Hotel Dan, and the Japanese and Taiwanese Perl communities in general for being so much fun to be around.

I'm so indebted to everyone involved in this conference and all of the others, it makes every one of these trips fun and memorable, totally justifying the fact that Perl takes up such an unhealthily large part of my life and income.

Tomorrow is the hackathon, always my favourite part of conferences. On my agenda:

  • Hacking with gfx and rafl on Moose's XS stuff
  • Fixing a weird corruption that happens with Continuation::Delimited, so that finally it passes all of its tests. Once that happens I can focus on adding new failing tests, which is much more fun ;-)
  • Playing with PSGI/Plack, Stardust, and other cool things I learned about during the conference and will probably discover tomorrow

On Sunday, the last day of my trip, CL and I are going to try to not get killed on Mount Myogi.

Saturday, September 5, 2009

Cache Assertions

Caching an expensive computation is a great way to improving performance, but when it goes wrong the bugs can be very subtle. It's vital to be sure sure that the cached result is, for all intents and purposes, the same as what the computation would have produced.

A technique I often employ is using a thunk to defer the computation:

$cache_helper->get_value( $key, sub {
   ... produce the expensive result here ...
});

If there is a cache hit then the subroutine is simply not executed:

sub get_value {
    my ( $self, $key, $thunk ) = @_;

    if ( my $cache_hit = $self->cache->get($key) ) {
        return $cache_hit;
    } else {
        my $value = $thunk->();

        $self->cache->set( $key => $value );

        return $value;
    }
}

The get_value method must be referentially transparent; when invoked with the same $key it should always produce a consistent result regardless of the state of the cache.

When adding caching to existing code I often accidentally choose a cache key that doesn't contain all the implied parameters since the details of the code I'm caching are no longer fresh in my memory.

The reason I like the thunk approach is that $cache_helper can be swapped for something else to easily check for such caching bugs.

For instance, to confirm that a reproducible problem is a cache related bug, you can disable caching temporarily using this implementation:

sub get_value {
    my ( $self, $key, $thunk );

    return $thunk->();
}

Or better yet, use an asserting helper in your test environment, where performance isn't important:

sub get_value {
    my ( $self, $key, $thunk ) = @_;

    my $value = $thunk->();

    if ( my $cache_hit = $self->cache->get_value($key) ) {
        $self->check_cache_hit($cache_hit, $value);

        # it's important to return $cache_hit in case check_cache_hit
        # isn't thorough enough, so that this doesn't mask any bugs
        # that would manifest only in production
        return $cache_hit;
    } else {
        $self->cache->set( $key => $value );

        return $value;
    }
};

sub check_cache_hit {
    my ( $self, $cache_hit, $value ) = @_;

    confess "Bad cache hit: $key, $cache_hit != $value"
        unless ...;
}

This approach can be used on any sort of cache, from $hash{$key} ||= $thunk->() to CHI.

It's useful to have several get_value like methods, to handle the different result types more easily (check_cache_hit can be a little intricate, and adding namespacing to keys is always a good idea).

This is useful for other reasons too, for instance collect timing information, to measure the effectiveness of caching, or use different cache storage for different result types.

The key to using this effectively is to make the caching object an easily overridable delegate in your application's code. For instance, if you're using Catalyst you can specify the subclass in your configuration, or even make it conditional on $c->debug.

Lastly, it's worth mentioning that my sample implementation is actually wrong if false values are valid results.

Tuesday, September 1, 2009

Try::Tiny

I just released Try::Tiny, yet another try { } catch { } module.

The rationale behind this module is that Perl's eval builtin requires large amounts of intricate boilerplate in order to be used correctly. Here are the problems the boilerplate must address:

  • $@ should be localized to avoid clearing previous values if the eval succeeded.
  • This localization must be done carefully, or the errors you throw with die might also be clobbered.
  • if ( $@ ) { ... } is not guaranteed to detect errors.

The first problem causes action at a distance, so if you don't address it your code is very impolite to others. The other two problems reduce the reliability of your code. The documentation contains an in depth explanation of all of these issues.

Here is my standard boilerplate for a polite, defensive eval blocks:

my ( $error, $failed );

{
   local $@;

   $failed = not eval {

       ...;

       return 1;
   };

   $error = $@;
}

if ( $failed ) {
   warn "got error: $error";
}

This is extremely tedious when really all I want to do is protect against potential errors in the ... part.

Try::Tiny does that and little else; it runs on older Perls, it works with the various exception modules from the CPAN, it has no dependencies, it doesn't invent a new catch syntax, and it doesn't rely on any mind boggling internals hacks.

If you are not comfortable using TryCatch and you don't mind missing out on all of its awesome features then you should use Try::Tiny.

If you think plain eval { } is fine then your code is potentially harmful to others, so you should also use Try::Tiny.

Despite its minimalism Try::Tiny can be pretty expressive, since it integrates well with Perl 5.10's switch statement support:

use 5.010; # or use feature 'switch'

try {
    require Foo;
} catch {
    when ( /^Can't locate Foo\.pm in \@INC/ ) { } # ignore
    default { die $_ } # all other errors are fatal
};

For backwards compatibility you can obviously just inspect $_ manually, without using the when keyword.

Friday, August 28, 2009

git rebase for the Impatient

Let's say you created a branch from the commit foo, and then committed the change bar:

While you were working on that somebody committed baz and pushed it to the upstream master:

Here is the graph of the divergence:

You now have two choices.

If you use git pull that will automatically merge master into your branch:

If you use git pull --rebase that apply your bar commit on top of the upstream, avoiding the extra merge commit and creating a linear history:

If you're used to subversion, this is very similar to what running svn up before svn commit does.

At this point you can push your work, and no evil manage will "break" your branch. Using git pull --rebase is a safe and natural fit when using a shared git repository.

Note that rebase will recreate those commits, there are two commits called bar, one whose parent baz, and one whose parent is foo which is still available in git reflog branch:

The above images are screenshots from the lovely GitX. They are actually lies, I created branches called rebase, merge and ancestor for clarity. gitk and git log --decorate both provide similar visualizations with decreasing levels of shiny.

You can read more on why I think rebase is awesome or make git pull use --rebase by default.

Lastly, if you want to learn more, try reading Git from the bottom Up, Git for Computer Scientists, and the Pro Git book's chapter on rebasing amongst other things.

Moose Triggers

Moose lets you provide a callback that is invoked when an attribute is changed.

Here is an example of "height" and "width" attributes automatically updating an "area" attribute when either of them is modified:

has [qw(height width)] => (
    isa     => "Num",
    is      => "rw",
    trigger => sub {
        my $self = shift;
        $self->clear_area();
    },
);

has area => (
    isa        => "Num",
    is         => "ro",
    lazy_build => 1,
);

sub _build_area {
    my $self = shift;
    return $self->height * $self->width;
}

The Problem

Unfortunately the implementation is sort of weird. To quote the documentation:

NOTE: Triggers will only fire when you assign to the attribute, either in the constructor, or using the writer. Default and built values will not cause the trigger to be fired.

This is a side effect of the original implementation, one that we can no longer change due to compatibility concerns.

Keeping complicated mutable state synchronized is difficult enough as it is, doing that on top of confusing semantics makes it worse. There are many of subtle bugs that you could accidentally introduce, which become very hard to track down.

In #moose's opinion (and mine too, obviously), anything but trivial triggers that just invoke a clearer should be considered code smell.

Alternatives

If you find yourself reaching for triggers there are a number of other things you can do.

Usually the best way is to avoid mutable state altogether. This has a number of other advantages (see older posts: 1, 2, 3). If you can eliminate mutable state you won't need triggers at all. If you merely reduce mutable state you can still minimize the number and complexity of triggers.

The general approach is to take all of those co-dependent attributes and move them into a separate class. To update the data an instance of that class is constructed from scratch, replacing the old one.

Another approach is to just remove the notion of an attribute from your class's API. Make the attributes that store the mutable data completely private (don't forget to set their init_arg to undef), and in their place provide a method, preferably one with a clear verb in its name, to modify them in a more structured manner.

Instead of implicitly connecting the attributes using callbacks they are all mutated together in one well defined point in the implementation.

The method can then be used in BUILD allowing you to remove any other code paths that set the attributes, so that you only have one test vector to worry about.

The code that Moose generates for you should be helpful, don't let it get in the way of producing clean, robust and maintainable code.

Monday, August 24, 2009

Abstracting Ambiguity

The Quantum::Superpositions module provides a facility for ambiguous computation in Perl. This idea has been added to Perl 6 under the name "Junctions". This data type lets you treat a list of alternatives for a value as thought that list were a single scalar.

Implementing a backtracking search is more work than a simple result validator. Backtracking complicates control flow and requires you to manage those lists of alternatives[1]. The added effort pays off in efficiency compared to an exhaustive search built on top of a simple validator, which only needs to worry about a single value for each parameter at any given point.

Junctions work by overloading a single datum to behave as if it were multiple values simultaneously. They lets pretend you're checking a single result, when you're actually checking many possible results at the same time.

If you're played a bit with Perl 6 you probably know how much fun it is to use junctions to simplify managing those lists of alternative values.

My friend awwaiid recently introduced me to the amb operator invented by John McCarthy. It reads a little like the any junction constructor:

my $x = amb(1, 2, 3);
my $y = amb(1, 2, 3);

amb_assert($x >= 2);
amb_assert($x + $y == 5);
amb_assert($y <= 2);

say "$x + $y = 5"; # prints 3 + 2 == 5

But this is where the resemblance ends[2]. Instead of overloading the data stored in $x and $y, amb overloads the control flow of the expressions that assign to $x and $y.

When amb is invoked it captures a continuation, and then invokes it with the first value it was given (this causes that value to be returned from amb). Execution then resumes normally; in the first expression $x is assigned the value 1 because that's what the continuation was invoked with.

Eventually the first assertion is evaluated and fails because $x >= 2 is false. At this point we backtrack to the last amb (kind of like using die). Since the continuation that assigns the value to $x has been reified we can use it again with the next value, 2.

What amb is doing is injecting the control flow required for backtracking into the stack, adding backtracking retrying logic. Every amb expression is an implicit backtracking marker.

The above example is taken from the Continuation::Delimited test suite. I think this is a very nice example of how continuations (delimited or otherwise) can be used to create powerful but simple constructs. By treating program flow as first class data we can create abstractions for the patterns we find in it.

Delimited continuations allow you to wrap, discard and replicate pieces of the program's flow; these are all simple manipulations you can do to first class data. While traditional first class functions let you do these things to operations that haven't happened yet, with continuations you can manipulate operations that are happening right now, enabling the expressiveness of higher order functions to be applied very interesting ways.

The real beauty of continuations is that the program's flow is still written in exactly the same way as before. Continuations are captured and reified from normal, unaltered code, which is why they are such a general purpose tool.

[1] It gets even harder if mutable state is involved, since backtracking would require rolling that state back before retrying. Yet another reason to prefer immutability.

[2] This example can't be translated directly to junctions. The predicate checks don't (and can't) modify the junctions, so they do not compose with each other implicitly.

Although junctions only provide a limited aspect of nondeterministic programming that doesn't mean they are less useful than amb, they're just different. When collecting permuted results is what you really want then overloading the data (as opposed to control flow) is a more direct approach.

The expression 1|2|3 + 1|2|3 == 5 evalutes to true because at least one of the permuted sums is equal to 5. Although it can't tell you which of the summands amount to 5, the ability to get a junction for the sums (1|2|3 + 1|2|3) is useful in its own right.