Friday, August 28, 2009

git rebase for the Impatient

Let's say you created a branch from the commit foo, and then committed the change bar:

While you were working on that somebody committed baz and pushed it to the upstream master:

Here is the graph of the divergence:

You now have two choices.

If you use git pull that will automatically merge master into your branch:

If you use git pull --rebase that apply your bar commit on top of the upstream, avoiding the extra merge commit and creating a linear history:

If you're used to subversion, this is very similar to what running svn up before svn commit does.

At this point you can push your work, and no evil manage will "break" your branch. Using git pull --rebase is a safe and natural fit when using a shared git repository.

Note that rebase will recreate those commits, there are two commits called bar, one whose parent baz, and one whose parent is foo which is still available in git reflog branch:

The above images are screenshots from the lovely GitX. They are actually lies, I created branches called rebase, merge and ancestor for clarity. gitk and git log --decorate both provide similar visualizations with decreasing levels of shiny.

You can read more on why I think rebase is awesome or make git pull use --rebase by default.

Lastly, if you want to learn more, try reading Git from the bottom Up, Git for Computer Scientists, and the Pro Git book's chapter on rebasing amongst other things.

Moose Triggers

Moose lets you provide a callback that is invoked when an attribute is changed.

Here is an example of "height" and "width" attributes automatically updating an "area" attribute when either of them is modified:

has [qw(height width)] => (
    isa     => "Num",
    is      => "rw",
    trigger => sub {
        my $self = shift;
        $self->clear_area();
    },
);

has area => (
    isa        => "Num",
    is         => "ro",
    lazy_build => 1,
);

sub _build_area {
    my $self = shift;
    return $self->height * $self->width;
}

The Problem

Unfortunately the implementation is sort of weird. To quote the documentation:

NOTE: Triggers will only fire when you assign to the attribute, either in the constructor, or using the writer. Default and built values will not cause the trigger to be fired.

This is a side effect of the original implementation, one that we can no longer change due to compatibility concerns.

Keeping complicated mutable state synchronized is difficult enough as it is, doing that on top of confusing semantics makes it worse. There are many of subtle bugs that you could accidentally introduce, which become very hard to track down.

In #moose's opinion (and mine too, obviously), anything but trivial triggers that just invoke a clearer should be considered code smell.

Alternatives

If you find yourself reaching for triggers there are a number of other things you can do.

Usually the best way is to avoid mutable state altogether. This has a number of other advantages (see older posts: 1, 2, 3). If you can eliminate mutable state you won't need triggers at all. If you merely reduce mutable state you can still minimize the number and complexity of triggers.

The general approach is to take all of those co-dependent attributes and move them into a separate class. To update the data an instance of that class is constructed from scratch, replacing the old one.

Another approach is to just remove the notion of an attribute from your class's API. Make the attributes that store the mutable data completely private (don't forget to set their init_arg to undef), and in their place provide a method, preferably one with a clear verb in its name, to modify them in a more structured manner.

Instead of implicitly connecting the attributes using callbacks they are all mutated together in one well defined point in the implementation.

The method can then be used in BUILD allowing you to remove any other code paths that set the attributes, so that you only have one test vector to worry about.

The code that Moose generates for you should be helpful, don't let it get in the way of producing clean, robust and maintainable code.

Monday, August 24, 2009

Abstracting Ambiguity

The Quantum::Superpositions module provides a facility for ambiguous computation in Perl. This idea has been added to Perl 6 under the name "Junctions". This data type lets you treat a list of alternatives for a value as thought that list were a single scalar.

Implementing a backtracking search is more work than a simple result validator. Backtracking complicates control flow and requires you to manage those lists of alternatives[1]. The added effort pays off in efficiency compared to an exhaustive search built on top of a simple validator, which only needs to worry about a single value for each parameter at any given point.

Junctions work by overloading a single datum to behave as if it were multiple values simultaneously. They lets pretend you're checking a single result, when you're actually checking many possible results at the same time.

If you're played a bit with Perl 6 you probably know how much fun it is to use junctions to simplify managing those lists of alternative values.

My friend awwaiid recently introduced me to the amb operator invented by John McCarthy. It reads a little like the any junction constructor:

my $x = amb(1, 2, 3);
my $y = amb(1, 2, 3);

amb_assert($x >= 2);
amb_assert($x + $y == 5);
amb_assert($y <= 2);

say "$x + $y = 5"; # prints 3 + 2 == 5

But this is where the resemblance ends[2]. Instead of overloading the data stored in $x and $y, amb overloads the control flow of the expressions that assign to $x and $y.

When amb is invoked it captures a continuation, and then invokes it with the first value it was given (this causes that value to be returned from amb). Execution then resumes normally; in the first expression $x is assigned the value 1 because that's what the continuation was invoked with.

Eventually the first assertion is evaluated and fails because $x >= 2 is false. At this point we backtrack to the last amb (kind of like using die). Since the continuation that assigns the value to $x has been reified we can use it again with the next value, 2.

What amb is doing is injecting the control flow required for backtracking into the stack, adding backtracking retrying logic. Every amb expression is an implicit backtracking marker.

The above example is taken from the Continuation::Delimited test suite. I think this is a very nice example of how continuations (delimited or otherwise) can be used to create powerful but simple constructs. By treating program flow as first class data we can create abstractions for the patterns we find in it.

Delimited continuations allow you to wrap, discard and replicate pieces of the program's flow; these are all simple manipulations you can do to first class data. While traditional first class functions let you do these things to operations that haven't happened yet, with continuations you can manipulate operations that are happening right now, enabling the expressiveness of higher order functions to be applied very interesting ways.

The real beauty of continuations is that the program's flow is still written in exactly the same way as before. Continuations are captured and reified from normal, unaltered code, which is why they are such a general purpose tool.

[1] It gets even harder if mutable state is involved, since backtracking would require rolling that state back before retrying. Yet another reason to prefer immutability.

[2] This example can't be translated directly to junctions. The predicate checks don't (and can't) modify the junctions, so they do not compose with each other implicitly.

Although junctions only provide a limited aspect of nondeterministic programming that doesn't mean they are less useful than amb, they're just different. When collecting permuted results is what you really want then overloading the data (as opposed to control flow) is a more direct approach.

The expression 1|2|3 + 1|2|3 == 5 evalutes to true because at least one of the permuted sums is equal to 5. Although it can't tell you which of the summands amount to 5, the ability to get a junction for the sums (1|2|3 + 1|2|3) is useful in its own right.

Monday, August 17, 2009

Gospel of Consistency

MySQL is configured for data loss out of the box[1]. Putting stuff "in the cloud" won't make it magically scalable; without understanding and overcoming the constraints imposed by cloud computing your app will likely not work at all. When CouchDB 0.9 removed multi document transactions I was confused about the technical reasons, but much of the feedback I received seemed to misunderstand my point.

It seems that consistency, especially in a distributed environment is a thoroughly misunderstood concept.

Consistent data is data that makes sense. Much like you can't just pull correct data out of your ass, you can't put inconsistent data back in there and "eventually" pull out something consistent. Corrupt data may not be immediately obvious, which is why consistency requires methodology.

Taking a guess at the reasons behind MySQL's utterly broken default, I suspect that in this case that it's not even an optimization. To a person that does not understand concurrency silent data loss is not immediately apparent, but having your transaction fail to commit is very scary if you don't understand what's going on.

The Meaning of Inconsistency

If you wanted to have your car fixed, you take it to a garage. A transaction will take place, which involves you paying money, and the mechanic fixing your car.

An inconsistent state would be either you paying money and the problem not being fixed, or the mechanic fixing your car and not getting compensated. A rolled back transaction puts you back where you started, with a broken car and some money to fix it.

Suppose you went to a mechanic, the mechanic took your money and then discovered that a necessary part is not in stock, that's a simple error to resolve. This is an expected failure.

If on the other hand the part installation requires an specialized tool she might not have, this error is a little more complicated, but can be easily detected in advance.

Lastly, if a dinosaur attacks the garage and eats your car, that's an even more complicated failure mode. Thankfully dinosaur attacks aren't very common.

Here is coarse grained transactional protocol for paying money to get a car fixed:

  1. Client arrives with cash, a safe (that has no value) and a broken car
  2. Mechanic estimates a quote
  3. Client places cash in the safe, retaining the key (if there is insufficient cash the transaction is rolled back)
  4. Mechanic keeps safe in garage, and proceeds to work, as client waits in the lobby. If some error occurs (e.g. car turns out to be unfixable, the transaction is rolled back)
  5. Client inspects work
  6. Client gives key to mechanic, who extracts the cash

Note that this protocol doesn't protect against dinosaur errors (much like enabling fsync in your database doesn't protect against disk failure), but does protect against the other failure modes by ensuring there are sufficient funds, and that the mechanic does not get access to payment until the job is completed successfully.

The point of this example is to demonstrate how cumbersome full ACID could be for a process that is generally expected to go over well. The cost of lugging around a safe in order to ensure your mechanic doesn't screw you over is unreasonable. If you paid and didn't get your money's worth you use a conflict resolution tool, such as a lawyer.

If every trip to the supermarket would require dragging a safe to escrow your money the world economy would grind to a halt.

Having a scalable system means balancing out the risk and cost of failures (expected and unexpected) vs. the cost of protecting against them. The garage would probably incur a small penalty for a delay (for instance giving you a small discount). They would also probably pay insurance against unlikely but foreseeable events, like a fire. Investing in a laser defense system is not that sensible.

What it boils down to is that when you hand your cash over, you want to be reasonably sure that you're going to get what you paid for. If your mechanic says "sorry, I can't work on this type of car" that's a conflict that has been detected up front, but if she says "it might take me a few more days because I need to order a part" you can get back to your life in the mean while.

In the real world we often use optimistic concurrency to reduce the overhead for the most likely outcome.

Eventual Consistency In Software

The way the term "Eventual Consistency" is tossed around these days reminds me of this line from the Princess Bride.

An efficient garage runs under an eventual consistency model. We employ various tactics to reach consistency in the end if there are surprises, but generally there won't be any. This works well because unlike software, humans are very adaptable to failure.

The key to implementing an eventually consistent system is that when you make a promise to do something (e.g. the cash handover) you should be damn sure you can finish the job, even if there are a few complications (e.g. ordering additional parts). Allowing for real failures are not a good option, because breaking a promise is much more complicated than simply not making one in the first place.

Note that you can fix inconsistencies after the fact, this is called a compensating action, but predicting the failure modes, testing for them, and fitting auditing for these repairs into the schema is usually a lot more complicated than making sure you wouldn't need them in the first place.

In traditional multithreaded programming you need to use mutexes to protect a critical section. You could use a single mutex to protect a large sequence (a combination of several steps), or you could use finer grained locking to protect each sub step separately. You still need locking to avoid race conditions for each critical sub-step.

For instance, suppose you're writing an OS kernel and you need to allocate a page of memory for a process. If your kernel is threaded you need to ensure that the updates to the free list and the allocated list have no race conditions (or you might corrupt the memory maps because of two simultaneous allocations). The easiest way to do that is to use a mutex for the duration of the update.

Next, consider the counters used by commands like top to report information about system or process memory. These are not critical resources. Nothing really depends on them providing the right value, so they could be momentarily inconsistent with the actual memory maps. Basically, there is no real benefit in keeping the critical memory map mutex until you've updated all the counters, it slows down the memory access for the program requesting the page, and for any other process needing to make an allocation.

The eventual consistency way to allocate memory would be for the kernel thread that allocates the memory page to simply ignore the counters. It would need to safely update memory map and then make a note of the change in some queue. This way the mutex is kept for a short duration.

Later, a low priority thread would aggregate the various updates from the queue, and bring the counters up to date, making them consistent with the memory maps.

This does not mean that the update to the memory map can be done without a mutex. The eventual consistency fairy will not come and magically fix the memory corruption in the event of a race condition. You also still needs to use a mutex to protect the counters and the queue.

BASE and ACID are not really opposites, despite the pun in the name. BASE is an alternative to using ACID for everything, but it doesn't completely replace ACID, it builds on it as a low level tool.

Eventual consistency is about avoiding database transactions that take unreasonably long or require unreasonable complexity (multi node synchronization) by splitting them up into smaller localized steps. This also allows you to return results to the client before all the steps have been finished.

This means that the client might send the next request before all the steps have been finished, and get a result that is out of date, but this inconsistency must be fixed eventually. If it won't, that's a bug (eventual consistency still implies consistency, without it the consistency model is a weak one).

If the client tries to make a conflicting change, the steps must to be carried out in the correct order, and on the correct shard (the one that is authoritative for the data being changed), so that conflicts can never happen in any transactional step except the first one. By partitioning the data wisely you can be sure that such changes can always be made on a single master, and thus you can avoid the need for distributed transactions.

ACID transactions are like scope of a mutex protected operation. You still need some ACID guarantees to implement BASE correctly, but BASE methodologies are careful to avoid unnecessary contention and complexity by keeping the transactions small and localized.

Compromising on consistency this way means that some database reads could temporarily return stale data. To properly deal with that, avoid using stale data as a basis for future write, or make sure your infrastructure won't let you do that by refusing to commit your transactions. Either the read for an update or both the read and the write for an update needs to go through to the single authoritative copy of that data.

Eventual consistency is not simpler than ACID, at least not for the programmer. It lets you build a scalable architecture from simpler primitives. If databases could work out what parts of the transactions can lag then BASE could be automated, but they can't, so it's the responsibility of the implementors to draw the lines in the right places.

In the kernel example, the naive threading approach was replaced by a more complex event driven model (which is still multithreaded). Event oriented programming may require managing more components, but forcing an ill-suited paradigm to scale is a much harder problem to solve.

Lastly, keep preemptive error detection in mind. Your system needs to deal with failure and detecting it ahead of time is the easiest way to do that. In the real world we are much more adaptable and can assess the situation to fix these problems as they arise, but software is made of fail, so it should stay simple and predictable. If you need to implement complex compensating actions you might as well go with a truly multi master infrastructure.

CouchDB 0.9

A while ago I wrote that CouchDB doesn't fit my needs well, based on my experience with version 0.8. What I didn't know is that CouchDB 0.9 had been out for a while, and while this version fixed some of my concerns, it also took away the one feature I really liked: the beautifully lightweight transactions.

In version 0.8 whenever you did a bulk update/insert of several documents at the same time, commits would be atomic. This means that if any single document had a conflict, the whole operation would fail. In version 0.9 only that document results in a conflict.

What this means is that the atomicity guarantee can no longer be provided for sets of documents, if your documents are co-dependent you might end up with inconsistencies in your database. Transactions are scoped to a single document only.

While some of my past criticism has been resolved and other parts still stand, I think this change is much more crucial when deciding whether or not to use CouchDB.

The rationale behind the removal is that it's too difficult to implement multi document transactions in a way that supports replication and transparent sharding, so to make sure that projects that start development on a single CouchDB server can scale to a cluster architecture, they need to deal with the transactional limitations early on.

Transparent Sharding

In my aforementioned email I asked why an explicit opt-in atomicity flag couldn't be introduced. At least in principle the implementation of such a flag is not very complicated. The problem is that in order for this to be useful, the sharding needs to be implemented such that the set of documents lives on a single node.

One of the design goals of CouchDB is that clustering should embed a minimal amount of knowledge about how the cluster is administered into the client. In short, the application code is supposed to be able to stay oblivious to the sharding, and therefore scale transparently if necessary.

The explicit removal of this feature implies that the "correct" sharding model is that of a distributed hash table. Even with consistent hashing the selection of the node responsible for storing a set of documents is supposed to be opaque.

In light of this change, to properly implement eventual consistency on top of CouchDB you need to update your documents in the correct order. If you update a document and get a conflict, all related updates should be aborted. However, if this document succeeds you need to be sure that all related updates can be carried out successfully. More specifically, any related update that fails due to a conflict must be simple enough to retry with the updated version.

The ACID guarantees you get from CouchDB only extend to a single document, so that is the coarsest granularity you can use. To implement a BASE like change workflow with CouchDB you need to make sure that every data update operation that affects multiple documents will only conflict in the first document that you update, or be prepared to implement compensating actions to resolve conflicting versions of documents after the fact.

Secondly, once you've issued that first crucial update, you also need to be sure that all the related updates are still run. If you already returned a result to the client and these actions run in batch, this can get tricky. BASE requires reliable message handling as well, using transactional message queue.

That last bit is what troubles me most about CouchDB. Designing a data model whereby updates must be queued transactionally but transactions are limited to one document is rather tricky.

Reducing Contention

The CouchDB savvy way to implement deferred updates is actually to avoid issuing updates at all. The view subsystem lets you aggregate data and do simple inter-document data munging. CouchDB ensures that views will be synchronized to the state of the database, and the purely functional approach prevents conflicts. Hopefully the view system will be generalized to support multiple levels of mapping and views derived from other views.

I've already written about the benefits of append only models for scaling (and see also first and third posts on that topic).

If most data operations involve creation of new documents (a process that shouldn't fail) and simple updates after those have succeeded it then you can make better use of the ACID guarantees CouchDB provides. Unsurprisingly, this is precisely the approach the book recommends.

If views are not enough for you, you can obviously use some sort of message queue, but the act of dequeuing a message and applying its effect should be protected by a single transaction. If a deferred update is not executed or executed more than once, the data will end up inconsistent (unless the update is idempotent)[2].

If you reduce all contention down to a single document you can effectively use CouchDB without sacrificing the consistency of your data.

Conclusions

I still don't use CouchDB for work. My data models tend to be fairly interlinked so working around the lack of multi document transactions outweighs the potential benefits of using CouchDB. It's just not a good fit for writing KiokuDB code that works with other backends well. That doesn't mean you shouldn't use CouchDB, but it means that if your data is heavily interlinked if you introduce concurrency your data might get corrupted.

If you look past the hype, CouchDB doesn't relieve you of thinking about how to design your schema. Of course, there is nothing wrong with that. Like any tool, you still need to know what it's doing to use it correctly.

Secondly, back in Perl land I write all my transactional code in such a way that failure is an option. Either transactions can fail to commit, or you need to use locks, or some data will eventually get overwritten. Pick any one, and choose or configure your database accordingly.

Lastly, I don't expect scalability to magically happen on its own. There is no secret formula that lets you wish the difficulties away. With good habits you can make the transition happen more easily, but I think it's still vital to understand how everything works under the hood, so that you know if its viable, and so that you can overcome the limitations successfully.

Framing this in terms of Werner Vogel's often linked post, you need to design a schema that will work with the consistency guarantees that your infrastructure provides (e.g. casual consistency, monotonic write consistency, etc)[3].

See Also

Distributed systems are a rich topic. What it always boils down to is achieving an approximation of the global ordering of events. Distributed systems exist because we simply can't scale to handle everything serially, in a single point of failure. If the global ordering is total then replication adds redundancy, but reduces scalability.

The compromise always revolves around finding a partial ordering for your operations that still produces consistent data, so that even if the variations happen in a order of events as observed by the distributed parts, the operations that do depend on each other are always handled in sequence and the result is the same everywhere.

The partial order can be negotiated by passing explicit or implicit synchronization tokens to the distributed components, waiting until some consensus is reached for global transactions, or completely localizing authority over a resource (as in BASE), so that the order of events that affects a piece of data is determined node responsible for that resource.

[1] Under the default isolation level, if your issue a SELECT statement, and uses the results to create a separate UPDATE statement in the same transaction MySQL will simply overwrite any updates made by other transactions in the mean while. The SELECT and the UPDATE are not a single atomic operation to outside observers. Of course, this isn't really a problem since everyone uses SELECT FOR UPDATE, right?

[2] Safe updates can be done by creating a new document for every update operation, essentially treating CouchDB as a message queue, and keeping track of which updates have been applied in the document they apply to. If you don't need auditing, handled updates can then be deleted, and the list of applied update IDs can be garbage collected. These operations are possible to implement with single document atomicity guarantees.

[3] Ironically I can't actually find any documentation about the guarantees that SimpleDB makes. I've head vector clocks being mentioned, they can be used to detect conflicts post factum, but they can't fix them so I'm not sure how a writer would be alerted of conflicts found during replication. Anybody got a clue?

Saturday, August 15, 2009

In Defense of APIs

This is response to Laurent Dami's post about when to prefer direct access to the hash fields of an object over an encapsulated API.

The point of the article is that encapsulation is a tradeoff, especially in Perl where you could do some things more easily or efficiently if you decide to give it up.

While I agree that this tradeoff exists, I think that it's not the only tradeoff you're allowed to make, and that Laurent's reasoning in defense of hash access can be solved in ways that don't involve breaking encapsulation and that are in my opinion superior.

In short, I find most of his examples are either better solved by Moose, or fall short of being substantial in my opinion.

I will attempt to refute his arguments one by one.

Good Perl idioms are no longer available

Admittedly the fact that the results of object methods or subroutine application cannot be easily interpolated into strings is a big annoyance in Perl, but that's far from a good reason to avoid them.

I might be alone on this, but personally I find the rest of these examples quite troubling. I think they miss the point of OO programming.

By refactoring these examples into a real API we no longer need the comments to explain what they're actually doing. Maybe the code implementing these methods is not as idiomatic, but we don't need to know it's there (and since we don't know, it could be that it actually is as idiomatic as he likes under the hood.

The total lines of code may be longer for a single usage, but most applications that manipulate points in a 2D space would need to do these operations many times, reducing the overall verbosity:

my $opposite = $point->symmetry_transform;

my $zoomed = $point->zoom($factor);

# I'd actually use a wrapper object for this
# (DisplayPoint with a Menu and a Point delegate)
my $point = Point->new_with_traits(
    traits => [qw(Menu)],
    menu   => { color => "green" },
    ...
);

There is one final example:

# temporary push aside
{ local $point->{x} += $far_away;
    do_something_without_that_point_in_the_way();
} # point automatically comes back to its previous location

And though for this one there is no direct correspondence[1], the problem is that instead of passing the set of relevant points to do_something_without_that_point_in_the_way(), the point is mutated to temporarily hide its effects.

I've posted on immutability before and why I think that this approach is wrong, but suffice it to say that I would write the above using an explicit collection of points, and make the operation of creating new collections easy instead of monkey patching the data.

No obvious distinction between "setter" methods and other methods

This problem can be solved by using an alternate method naming style. For instance MooseX::Policy::SemiAffordanceAccessor makes the default setter for an attribute named foo into set_foo.

Personally I feel this is overkill. If the relationship between the object and the method name is obviously that of a noun and an attribute, such as $thing->size(10), that is far from ambiguous.

For cases that are more involved, like a field manipulation that requires additional actions I tend to have a method name that has a verb in it and keep the setter completely private. Since the method would be doing something more than just set_ that is reflected in the API.

Hard to debug

This is a deficiency in the Perl debugger. I strongly believe that the maintainability of your code shouldn't suffer because of inferior tools. Someone might fix the Perl deubgger one day. Though unlikely, it's much more unlikely that they will also all of my code.

Since the inner workings of Catalyst is obviously not what Laurent wants to debug in the example, the problem lies in the Perl debugger's inability to handle complex (and idiomatic) perl expressions.

Furthermore, in this particular case it shouldn't be a problem, $self is known and you could simply set a breakpoint for that class's min_body method and continue execution instead of stepping through every operation.

Secondly, if Catalyst had supported lazy body parsing, in which case that line might be run before the body length is actually known (the body may not have been fully read), then this operation wouldn't be a simple accessor fetch but rather a blocking read and an update of the request object.

This polymorphism in the API is precisely what makes it possible to enable such optimizations in future versions of Catalyst without breaking existing applications. If we simply used length($c->{request}{body}) directly, there would be no room for future improvements, even on an HTTP abstraction that could support that easily.

Non-scalar attributes must either copy or reinvent an API

This problem in particular is solved using MooseX::AttributeHelpers (note that this will be in core Moose soon, after receiving a big facelift).

In the more general sense, Moose deeply supports the encapsulation and packaging of patterns using its meta trait system. By applying traits to attribute we can alter their behavior, providing additional functionality in a concise and declarative way.

There's no need to reinvent or copy because of Moose's extensibility, something that is missing from other class systems, and this applies not only to non scalar data, but to many other patterns for which MooseX modules were written.

No generic data traversal modules

Again, introspection is key here. KiokuDB would never be possible without generic data traversal for Moose objects. That's what sets it apart from Pixie.

I've been meaning to write a Data::Visitor::Moose for a long while but simply never got around to it. It's definitely possible, and even not that hard.

Here is a simple example demonstrating JSON integration:

method TO_JSON {
    return {
        map { $_->name => $_->get_value($self) }
            $self->meta->get_all_attributes
    };
}

But the really nice thing is that since all the attributes have extensible metadata we can be very particular about how we filter them. Unfortunately the JSON module's API doesn't allow us to specify parameters to the TO_JSON method, but if it would then we could very easily output different variants for different contexts.

From the point of view of the class being serialized (obviously the TO_JSON method could go in a role and get reused), we could decorate attributes with traits to determine their correct behavior. For instance an attribute that is used both by an ajax frontend displaying the data, and a REST Json feed for this object could be declared as:

has name => (
    traits => [qw(MyApp::View MyApp::REST)],
    ...
);

The actual implementation involves two roles (MyApp::View and MyApp::REST) with no implementation, and a simple $attr->does($role) call in the TO_JSON method.

This is not only cleaner, but also more powerful than violating encapsulation. The difference is that the encapsulation provided by e.g. Class::Accessor or Object::Tiny is just not introspectable or extensible.

Methods are slower than direct access to attributes

I won't argue about this, but in my experience even with large (some would say gratuitous) amounts of method calls I have never been able to demonstrate that accessors or other small methods were an actual performance issue (that is, using a profiler on real code with real inputs).

My dying laptop can make about 1,000,000 method calls per second. My $20 a month virtual private hosting image does 3,000,000. The difference between 5 method calls and 500 in a single dynamic web request is still too small to actually matter, at least for the kind of apps that I write. 100x as many method calls for the same amount of code will not be 100x slower, but will probably be 10x more maintainable ;-)

That aside, the XS branch for Moose has been performing almost as fast as direct hash access in microbenchmarks, while still providing full type validation, support for lazy attributes, access control, etc. This is because XSUBs are faster than pure perl subroutines (they require no context scope, and a single opcode dispatch). And of course these accessors can be subclassed or swapped out for more involved implementations should the need arise.

The XS branch is unmerged and unfinished, but there are a number of other XS accessor implementations on the CPAN if that is actually a problem in your code.

Conclusion

Laurent concluded his post saying:

In Perl, fully encapsulated objects are sometimes the best solution, sometimes not; weight these considerations before taking strong design decisions.

An interesting design is the one of DBI : objects (handles) are totally encapsulated, yet they exploit the power of tie to expose their attributes through a conventional hashref API, instead of OO getter and setter methods. This is a very clever compromise.

As far as I am concerned, I purposedly designed DBIx::DataModel to fully exploit the dual nature of Perl objects, having both the OO API for executing methods, and the hashref API for accessing column values. I wouldn't necessarily do that everywhere, but for row objects in a database, which are very open in nature, this just seemed an appropriate solution.

Specifically in the case of DBIx::DataModel, I think that hash access is a valid approach. But this is because DB records are not objects, but tuples. You can model tuples as objects, but not the other way around.

I couldn't agree more with his sentiments that this a tradeoff you should consider, but I think his cutoff point is very different from mine. I would need to be much more desperate to reach for "naughty" hash access when accessors would do.

Lastly, every time I've used the DBI api I really wished it didn't do that horrible hashref hack. Cleverness is bad, it's confusing to newbies, it's a red flag to experienced programs new to this API, and to reference his previous point about speed, tie is about 4 times as slow as normal method calls. To me that feature seems like a half hearted attempt to allow polymorphism and encapsulation in an API where methods would break compatibility.

This complexity makes understanding DBI internals a lot harder, and by proxy makes writing DBDs harder too. This is a fact of life that we deal with because DBI has been around for longer than I've known to program and because it's still robust and performant. It has flaws but it makes up for them. That doesn't mean the tied hash attributes were a good idea.

I feel that my decision to be an OO purist, especially in a language where impurities are so tempting has made my job as a maintenance programmer much easier for me and my coworkers.

To me polymorphism is a far more important idiom than concise hash manipulation. It may not be unique to Perl, but it's damned useful.

[1] $point->localize( x => $new_value, body => sub { ... } )

Wednesday, August 12, 2009

reset { hack { shift { return $_[0] } } }

Lately my pet project has been implementing delimited continuations for Perl. In an effort to keep myself motivated (it's hard and frustrating), I'll try and explain what I'm actually trying to do.

Caveat lector, the repository is nasty. There is no coherent process in the versioned history, it's just git commit -a at the end of the day, so the repository will be breaking. Secondly, I don't really know what I'm doing. I do in principle, but the details are still vague and I am discovering them as I go along.

So, WTF are delimited continuations? As Perl programmers the easiest way to understand them is probably Tom Moertel's favourite explanation, by Oleg Kiselyov.

As core hackers, the easiest way to understand them is to see how they are implemented. This is what I will stick with.

But really, the best way to understand continuations is to use a language that actually has them, and play around a bit. It's a strange concept to get used to. Unlambda, the INTERCAL of functional programming uses continuations precisely because they can be difficult to understand (though technically they aren't delimited).

Continuations for Core Hackers

Delimited continuations are structured around two operations, reset and shift.

reset is the delimiter, it saves a position in the call chain for a future call to shift.

shift is what actually creates the continuation. It does that by walking up the call chain from the current position, up to the reset, and saving everything in between into the continuation. The continuation itself is reified as a function, and is passed as an argument to the block given to shift. Hold that thought, we'll get back to that in a second.

This interaction between shift and reset is much like the one between die and eval. eval delimits the stack, and die is what's called an escape continuation:

eval {
    warn "reached";
    die $value;
    warn "not reached";
};

return $@;

The die causes the code in the eval to be aborted, and sets the special variable $@ to $value. With shift and reset this could be expressed a little more elegantly as:

return cont_reset {
    warn "reached";
    cont_shift { return $value };
    warn "not reached";
};

What happens in shift is similar to die. The stack is unwound all the way up to the begining of reset. The difference is that the return value of the shift block becomes the return value of the reset block.

Scope::Upper makes this sort of abstraction possible already, as demonstrated by Continuation::Escape.

Capturing a Continuation

Semantically eval/die and delimtied continuations are actually very different. shift doesn't unwind the stack, but stashes the frames into a data structure.

Delimited.xs introduces two structs, delim_t and cont_t.

delim_t is very trivial, it contains a snapshot of the various state variables in the Perl interpreter, like the stack pointers, the current opcode, etc. When we call cont_reset a new delimiter is pushed onto a linked list.

Inside cont_shift the init_cont function will create another delimiter, and then destructively move all the stack frames between the two delimiters into a cont_t. When it's done current state of the interpreter has effectively been rewound to that of the start delimiter, but unlike die none of the data has been cleaned up, those stack frames are still "live", just not visible from the current state of the interpreter.

Reifying a Continuation

The restore_cont function takes a cont_t and appends copies of all the captured stack frames inside it to the current state of the interpreter. There are many things to fix up, and that is where the bulk of the implementation lies.

At the end of the restore_cont the next step for the interpreter is to resume execution right after the call to cont_shift, the end of the continuation capture.

The difference is when those stack frames are evaluated and we reach the end of the cont_reset block, instead of returning to the caller of cont_reset we return to the caller of restore_cont. To do this we overwrite the top level PERL_CONTEXT stack frame to use the current PL_op as the retop, instead of the one it was originally created with.

So how does restore_cont get called? We wrap the cont_t in a closure, that gets invoked as a function. Here is an example of no-op continuation usage:

my $value = cont_reset {
    return 3 + cont_shift {
        my $k = shift;
        $k->(7);
    };
};

is( $value, 10 );

This is no-op because cont_shift unwinds the stack, and gets the continuation in $k, and then immediately reinvokes it, restoring everything and causing cont_shift to return a value of 7 from cont_reset. This code is functionally equivalent to:

my $value = cont_reset {
    return 3 + 7;
};

Which is really the same as

my $value = sub { return 3 + 7 }->();

But $k is more than that, it's literally a function that appends a bunch of things to the interpreter's stacks. You can call it multiple times:

my $add_three = cont_reset {
    return 3 + cont_shift {
        my $k = shift;
        return $k;
    }
};

In this example cont_shift returns $k instead of invoking it. Since the stack has been unwound, this causes cont_reset to return $k.

So what happens when you invoke $add_three from the outside? The stack frames from reset until shift are appended, so the interpreter state is waiting for a value from shift to add 3 to.

This means that you can call $add_three->(7) and get back a value of 10, any number of times. In fact, this is what the basic sanity test does.

Gory Details

Unfortunately what goes on behind the scenes isn't that simple (almost nothing is when it comes to Perl internals).

delim_t is initialized by init_delim to capture relative offsets for PL_stack_sp, PL_markstack_ptr, and the values of cxstack_ix, PL_scopestack_ix, PL_savestack_ix and PL_tmps_ix (which are relative offsets anyway). This information captures all of the stack states.

In addition it keeps track of a number of variables (like the current opcode, PL_op, the current pad, PL_comppad, and another of other variables corresponding to the current state.

When we init_cont the start and end delim_ts are used to capture state in the cont_t using several functions.

init_cont_stack allocates buffers for the mark stack and value stack. All the stacked SVs are put into an AV, increasing their reference counts, and the marks are converted to 0 based offsets, based on the value of PL_stack_sp at the start delimiter.

init_cont_cxs captures Perl's context stack. This stack contains PERL_CONTEXT structures which contain all the information about the call chain as well as loops and other blocks requiring data.

In Perl the actual storage for lexical variables is not allocated on the context stack (though it probably should be), but instead its appended to CvPADLIST(cv), which is an array of arrays stored in the CV's data directly (this allows storage space to be cleared instead of freed, so repeated calls to the same subroutine are slightly faster).

The context stack is walked from the end to the begining, and any CXt_SUB, which denotes a subroutine call, causes the lexical pad instance of that subroutine to be popped off of CvPADLIST(cv) and stored in a pads AV in the continuation. CvDEPTH(cv) is adjusted appropriately.

The contexts also contain values like cx->blk_oldsp, previous offsets of the various stack pointers, and these, like the mark stack, are converted to be relative to the start delimiter.

init_cont_saves handles the save stack, and the associated scopestack (which is like the markstack of the savestack). The savestack is used to roll back the interpreter state when leaving a scope.

Any operation that would need to be undone is handled using the savestack, and dispatched using the ENTER and LEAVE macros.

our $foo = 1;
{ # calls ENTER
    local $foo = 3;
} # calls LEAVE

ENTER pushes PL_savestack_ix onto the scope stack. This marks the offset of the savestack.

local $foo pushes a save entry onto the savestack with the previous value of $foo, which is the IV 1.

When the scope that called local is exited the LEAVE macro pops an entry from the scope stack, and then calls leave_scope (from scope.c) to dispatch all the SAVEt_* entries between PL_savestack_ix and the value of PL_savestack_ix popped off the scope stack, causing $foo to be set back to 1.

There are many types of save entries which are crucial to properly managing Perl's data structures and runtime semantics. For instance SAVEDESTRUCTOR_X(function, pointer) can be used to automatically call function on pointer when the current scope is left, and is very useful for managing data from XS modules.

To properly handle the semantics of this stack unwinding we have to partition the savestack into entries that should be recreated every time restore_cont is called (and therefore called zero or more times), and those which should be recreated when the continuation is garbage collected (and therefore only called once).

The entries that should be recreated every time are those that keep track of the currently active lexical pad (SAVEt_COMPPAD), those that keep track of stack pointers (SAVEt_STACK_POS and SAVEt_STACK_CXPOS), and those that clear lexical variables (SAVEt_CLEARSV, pushed to the savestack by pad_sv opcodes when assigning the initial value to a lexical variable).

Localization entries will be handled in the future, but are currently deferred.

Lastly, init_cont_state resets the state of the interpreter to the start delimiter and detaches the two delimiters from the linked list.

restore_cont

Now that we have a detached continuation in a cont_t, appending its contents to the running interpreter is what remains.

The process of invoking a continuation is pretty much the inverse of the init_cont functions. All 0 based offsets are added to the current values of the stack counters, the cont->end delim_t is used to set the values of the interpreter variables, etc.

The main things that are done in addition is cloning of pads (and other state), and fixups at the top level context entry.

Pad cloning is done to ensure that each recreated instance of the continuation has its own copies of the lexical variables it uses. Without this the repeated invocations of the continuations will get in each others' way. Unfortunately this is pretty tricky, we have to walk the context stack top down and clone various values, keeping the addresses in a pointer table so that everything points back at the right thing. This is the most broken bit of code right now.

Similarly, SAVEt_COMPPAD entries have to be mapped as well, in order to reset the interpreter to the cloned pads, not the pads captured in cont_t.

Though not implemented, faked @_ arrays would need to be recreated with the duplicated value stack, and things like loop counters for foreach blocks also need more intricate copying.

Finally, the top level PERL_CONTEXT that was copied has to have its retop set to PL_op->op_next, where PL_op is the one making the call to the reified continuation. This causes the cont_reset block to return to where the caller invoked the continuation, properly connecting the call chain.

TODO

Tests are failing. Shit is segfaulting. Sanity is fading. But it works, sort of. This is a proof of concept, and now I have to finish off the various details to make it into a usable module.

If you'd like to help and have no core-fu, feel free to write failing tests. Anything you'd like to do with delimited continuations should be tests.

If you'd like to help and do have core-fu please review my horrible C code, point out obvious mistakes, or even implement some the missing bits.

I've been previously told that continuations aren't possible in Perl 5, but I think this has been demonstrated to be a false. Anyway, $title->().

Thursday, August 6, 2009

YAPC::EU::2009

So YAPC is pretty much over, and it was awesome. My last minute talks went somewhere between "fine" and "well" depending on the talk. Unlike most other YAPCs I actually caught some of the other sessions too.

I'm really happy with the feedback from the attendees, people are responding better than ever to Moose and even KiokuDB despite the fact that it's such a young project.

This YAPC, more than any other, is reassuring me that the Perl community is moving in the right direction. We seem to be refocusing on what's really important (to me anyway). We are not all the way there, Perl still has problems with marketing, technical debt, flawed perceptions, bad practices and so on, but it's obvious that the trend is one of moving towards where we want to be, not some inevitable doom.