Wednesday, November 18, 2009

Functional programming and unreasonable expectations

<record type="broken">I'm a big fan of purely functional programming</record>.

Another reason I like it so much is that purely functional software tends to be more reliable. Joe Armstrong of Erlang fame makes that point in an excellent talk much better than I could ever hope to.

However, one aspect he doesn't really highlight is that reliability is not only good for keeping your system running, it also makes it easier to program.

When a function is pure it is guaranteed to be isolated from other parts of the program. This separation is makes it much easier to change the code in one place without breaking anything unrelated.

Embracing this style of programming has had one huge drawback though: it utterly ruined my expectations of non functional code.

In imperative languages it's all too easy to add unstated assumptions about global state. When violated, these assumptions then manifest in very ugly and surprising ways (typically data corruption).

A good example is reentrancy (or rather the lack thereof) in old style C code. Reentrant code can be freely used in multiple threads, from inside signal handlers, etc. Conversely, non-reentrant routines may only be executed once at a given point in time. Lack of foresight in early C code meant that lots of code had to be converted to be reentrant later on. Since unstated assumptions are by definition hidden this can be a difficult and error prone task.

The specific disappointment that triggered this post is Perl's regular expression engine.

Let's say we're parsing some digits from a string and we want to create a SomeObject with those digits. Easy peasy:

$string =~ m/(\d+)/;
push @results, SomeObject->new( value => $1 );

Encapsulating that match into a resuable regex is a little harder though. Where does the post processing code go? Which capture variable does it use? Isolation would have been nice. The following example might work, but it's totally wrong:

my $match_digits = qr/(\d+)/;

my $other_match = qr{ ... $match_digits ... }x;

$string =~ $other_match;
push @results, SomeObject->new( value => $1 ); # FIXME makes no sense

Fortunately Perl's regex engine has a pretty awesome feature that let you run code during a match. This is very useful for constructing data from intermittent match results without having to think about nested captures, especially since the $^N variable conveniently contains the result of the last capture.

Not worrying about nested captures is important when you're combining arbitrary patterns into larger ones. There's no reliable way to know where the capture result ends up so it's easiest to process it as soon as it's available.

qr{
    (\d+) # match some digits

    (?{
        # use the previous capture to produce a more useful result
        my $obj = SomeObject->new( value => $^N );

        # local allows backtracking to undo the effects of this block
        # this would have been much simpler if there was a purely
        # functional way to accumulate arbitrary values from regexes
        local @results = @results, $obj;
    })
}x;

Even though this is pretty finicky it still goes a long way. With this feature you can create regexes that also encapsulate the necessary post processing, while still remaining reusable.

Here is a hypothetical the definition of SomeObject:

package SomeObject;
use Moose;

has value => (
    isa => "Int",
    is  => "ro",
);

Constructing SomeObject is a purely functional operation: it has no side effects, and only returns a new object.

The only problem is that the above code is totally broken. It works, but only some of the time. The breakage is pretty random.

Did you spot the bug yet? No? But it's oh so obvious! Look inside Moose::Util::TypeConstraints::OptimizedConstraints and you will find the offending code:

sub Int { defined($_[0]) && !ref($_[0]) && $_[0] =~ /^-?[0-9]+$/ }

The constructor Moose generated for SomeObject is in fact not purely functional at all; though seemingly well behaved, in addition to returning an object it also the side effect of shitting all over the regexp engine's internal data structures, causing random values to be occasionally assigned to $^N (but only if invoked from inside a (?{ }) block during a match). You can probably imagine what a great time I had finding that bug.

What makes me sad is that the Int validation routine appears purely functional. It takes a value and then without modifying anything merely checks that it's defined, that it's not a reference, and that its stringified form contains only digits, returning a truth value as a result. All of the inputs and all of the outputs are clear, and therefore it seems only logical that this should be freely reusable.

When I came crying to #p5p it turned out that this is actually a known issue. I guess I simply shouldn't have expected the regexp engine to do such things, after all it has a very long history and these sorts of problems are somewhat typical of C code.

If the regexp engine was reentrant the what I tried to do would have just worked. Reentrancy guarantees one level of arbitrary combinations of code (the bit of reentrant code can be arbitrarily combined with itself). Unfortunately it seems very few people are actually in a position to fix it.

Purely functional code goes one step further. You can reliably mix and match any bit of code with any other bit of code, combining them in new ways, never having to expect failure. The price you have to pay is moving many more parameters around, but this is exactly what is necessary to make the boundaries well defined: all interaction between components is explicit.

When old code gets reused it will inevitably get prodded in ways that the original author did not think of. Functional code has a much better chance of not needing to be reimplemented, because the implementation is kept isolated from the usage context.

In short, every time you write dysfunctional code god kills a code reuse. Please, think of the code reuse!

14 comments:

Anonymous said...

I ran into this exact same problem two weeks ago, though not in moose - I was trying to be clever and got bitten by my *own* code.

This sort of issue is a good illustration of your point about functional and re-entrant code... though I hope the average reader sees that and not "omg the perl regex engine is bad!!1!" To be fair, I would want to point out that the regex docs are pretty clear that the (?{}) and (??{}) constructs are dangerous and experimental... and non-reentrant! :-)

jjore said...

In a former life I wrote Regexp::Approx at http://perlmonks.org/?node_id=306474 which forks off another interpreter to run reentrant regexp code. :-(

nothingmuch said...

my solution was pushing [ $thunk_coderef, $^N ] and then applying all of those after the match finished

zby said...

This is probably a pie in the sky - but wouldn't it be nice if it was possible to declare a sub as purely functional? Then the parser would check if there is any call to anything unsafe. It would have to be over-restrictive and not all purely functional code could be declared as such - but I think it could work for a class of subs. In a way it would work for subs as taint works for vars.

Jon Harrop said...

"Another reason I like it so much is that purely functional software tends to be more reliable. Joe Armstrong of Erlang fame makes that point in an excellent talk much better than I could ever hope to."

On the contrary, Erlang is reliable but not purely functional and, in fact, the only software ever written entirely in a purely functional language (Darcs, written in Haskell) and that garnered a significant user base (of only a few thousand users) was subsequently dropped by those users precisely because it was plagued with reliability (and performance) problems.

In reality, trying to use purely functional programming to solve many real problems introduces more bugs than it catches.

Unknown said...

dunno why i'm bothering to respond to a typical jdh troll but here it goes -- erlang is indeed purely functional (i.e. there is no mutation) and the problems with darcs (which is still quite popular, actually, and actively maintained, and whose standout problems have been fixed) were with regards to the time taken by certain algorithms, not with regards to anything else.

And jdh has absolutely no grounding for his last statement at all -- one can write perfectly nice purely functional code in ocaml or f# or what have you, and its just plain silly to say that avoiding side-effects somehow can "introduce bugs".

nothingmuch said...

Purely functional does not imply that the language restricts to that, I write purely functional Perl every day, and it's about as impure as you can get.

The point of the post is that purely functional components, regardles of whether or not they are written in a purely functional language are in generall more composable and reusable.

Ironically, while Git is implemented mostly in C it uses a largely purely functional model (everything but references is immutable), whereas darcs initially chose a mutable data model, which was part of the issues that plagued version 1. Git (at it's core) is much more pure than darcs.

nothingmuch said...

@zby - statically inferring that a subroutine is purely functional is trivial for the simple case - just prove it has no impure opcodes.

The problem is knowing which opcodes are pure and which are impure.

For example it turns out that matching a regex is impure because it's nonreentrant, but I never would have expected that since all of the input parameters are provided for it lexically in the Int validator.

zby said...

OK - so for the simple cases it is trivial - we could have a list of pure and impure operations and then add to the list of impure all subs that use anything impure. If the editor highlighted for you the impure operations in a different colour - then you'd not make your mistake. The question is if this is enough to be useful, if your particular case is common enough.

nothingmuch said...

Well, what would you do with it?

You can walk the optree and unless all of the ops are known pure (or have known pure implications in the context, such as an entersub whose subroutine child op is refers to a GV that contains a CV that is also known pure) return false. This check can be improved incrementally.

Once you have that you can write a B::Util::CheckPurity or something, which can check for the purity of a given subroutine easily.

With that you could then do interesting things, but the question is what =)

Some ideas:

1. Memoize::Automatically
2. a pragma to assert that all the subroutines declared in a certain scope must be pure unless otherwise excluded
3. debugging aids maybe?

...

Jon Harrop said...

@Sterling: Purely functional means side-effect free and not mutation free. Erlang allows uncontrolled side effects (message passing) and, therefore, is not purely functional.

Bugs introduced by purely functional programming typically fall into three categories: stack overflows, unusable performance and unusable memory consumption. The former is a consequence of avoiding loops and the latter is a consequence of laziness (which renders performance and memory consumption unpredictable).

nothingmuch said...

You are assuming that receiving a message is an operation with side effects. In the context of the blog post it would really be more appropriate to rail on Perl.

zby said...

Rewording what I said at IRC. The benefit of using a functional language is that you get perfect isolation, this is because there is no state. But with imperative languages you can also write code that is purely functional - you just need to isolate the state. This means you use only subs that don't change global Perl state and don't use any variables from outer scope. But it does not mean that they use only other pure functions. Here is an example:

package Adder;
use Moose;

has number => ( is => 'rw' );

sub add {
my ( $self, $value ) = @_;
return $value + $self->number;
}

package PureFunctional;

sub add_four {
my $value = shift;
my $adder = Adder->new;
$adder->number( 4 );
return $adder->add( $value );
}

The 'add' method is not purely functional because it changes the state of the object which is in the outside scope - but the add_four sub is purely functional because that object is an automatic variable of that sub.

The point is that if we had that hypothetical 'use pure' pragma we would have a language that gives you the convenience of imperative programming together with the safety of functional programming.

zby said...

It should have been "The 'number' method is not purely functional because it changes the state of the object which is in the outside scope"