Wednesday, July 1, 2009

PL_runops

I'm going to try to do a series of posts about learning Perl internals. I am still a beginner, I have trouble remembering the many macros, or keeping everything in my head all at once, so hopefully I will be able to make some sense of this stuff in a way that is accessible to other beginners.

While I'm definitely diving right in to the deep end, I think PL_runops is a good place to start as any, there's not a lot you need to learn to see how it works. I don't think Perl has a shallow end.

PL_runops is a variable in the interpreter containing a function pointer, which in most cases will be Perl_runops_standard.

Perl_runops_standard is the function that executes opcodes in a loop. Here's how it's defined:

int
Perl_runops_standard(pTHX)
{
    dVAR;
    while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
        PERL_ASYNC_CHECK();
    }

    TAINT_NOT;
    return 0;
}

Perl makes extensive use of macros, which can sometimes be confusing, but in this instance it's not too daunting. This loop will essentially keep calling PL_op->op_ppaddr, assigning the result to PL_op. As long as a valid value is returned it will keep executing code.

Just to get it out of the way, PERL_ASYNC_CHEK is a macro that checks to see if any signals were delivered to the process, and invokes the handlers in %SIG if necessary.

So what's PL_op? It's essentially Perl's instruction pointer, it refers to the opcode currently being executed. When Perl compiles source code it produces an optree. This is probably one of the more complicated structure in Perl internals, but right now we're only concerned with one small part, the op_ppaddr field of a single node in the tree. The op_ppaddr field contains a pointer to the function that implements the op.

PP stands for push/pop, which means that it's a function that operates in the context of the Perl stack, pushing and popping items as necessary to do its work, and returns the next opcode to execute.

pp.h defines a PP macro, which sets up a signature for a function that returns an op pointer. Let's have a look at two simple PP functions. First is pp_const:

PP(pp_const)
{
    dVAR;
    dSP;
    XPUSHs(cSVOP_sv);
    RETURN;
}

This is an implementation of the const op, which pushes the value of a literal constant to the stack. The cSVOP_sv macro is used to get the actual SV (scalar value structure) of the constant's value from the optree. SVOP is an OP structure that contains an SV value. The c stands for "current".

Let's rewrite the body of the macro using some temporary values:

/* the current opcode is op_const, so the op structure is an SVOP
 * Instead of using the long chain of macros we'll put it in a variable */
SVOP *svop = (SVOP *)PL_op; 

/* and then we can simply use the op_sv field, which
 * contains a pointer to a scalar value structure */
SV *sv = svop->op_sv;

This SV is then pushed onto the stack using the XPUSHs macro. The X in XPUSHs means that the stack will be extended if necessary, and the s denotes SV. To read the documentation of these macros, refer to perlapi.

The next thing that executes is the RETURN macro. This macro is defined in pp.h, along with a few others:

#define RETURN          return (PUTBACK, NORMAL)

#define PUTBACK         PL_stack_sp = sp

/* normal means no special control flow */
#define NORMAL          PL_op->op_next

/* we'll use this one later: */
#define RETURNOP(o)     return (PUTBACK, o)

Recall that there are no lists in C. The comma operator in C executes its left side, then its right, and returns that value (unfortunately we have that in Perl, too). The RETURN macro therefore desugars to something like:

PL_stack_sp = sp;
return(PL_op->op_next);

The XPUSHs macro manipulated the local copy of the pointer to the stack, sp as it was adding our SV. This change is not immediately written to the actual stack pointer, PL_stack_sp . The PUTBACK macro sets the "real" stack to the version we've manipulated in the body of our opcode.

Then the opcode simply returns PL_op->op_next. The op_next field in the op contains a pointer to the next op that should be executed. In this case if the code being executed was:

my $x = 42;

then the const op compiled to handle the 42 literal would have pushed an SV containing the integer 42 onto the stack, and the op_next in this case is the assignment operator, which will actually use the value.

So, to recap, when PL_op contains a pointer to this const op, PL_op->op_ppaddr will contain a pointer to pp_const. PL_runops will call that function, which in turn will push ((SVOP *)PL_op)->op_sv onto the stack, update PL_stack_sp, and return PL_op->op_next.

At this point runops_standard will assign that value to PL_op, and then invoke the op_ppaddr of the next opcode (the assignment op).

So far so good?

To spice things up a bit, here's the implementation of logical or (||):

PP(pp_or)
{
    dVAR; dSP;
    if (SvTRUE(TOPs))
        RETURN;
    else {
        if (PL_op->op_type == OP_OR)
            --SP;
        RETURNOP(cLOGOP->op_other);
    }
}

The or op is of a different type than the const op. Instead of SVOP it's a LOGOP, and it doesn't have an op_sv but instead it has an op_other which contains a pointer to a different branch in the optree.

When pp_or is executed it will look at the value at the top of the stack using the TOPs macro, and check if it evaluates to a true value using the SvTRUE macro.

If that's the case it short circuits to op_next using the RETURN macro, but if it's false it needs to evaluate its right argument.

--SP is used to throw away the argument (so that $a || $b doesn't end up returning both $a and $b).

Then the RETURNOP macro is used to call PUTBACK, and to return PL_op's op_other. RETURN is essentialy the same as RETURNOP(NORMAL). op_other contains a pointer to the op implementing the right branch of the ||, whereas op_next is the op that will use the value of the || expression.

This is one of the most basic parts of the Perl 5 virtual machine. It's a stack based machine that roughly follows the threaded code model for its intermediate code structures.

The data types mostly revolve around the SV data structure, and moving pointers to SVs from op to op using the stack.

For me the hardest part to learn in Perl is definitely the rich set of macros, which are almost a language in their own right.

If you search for runops on the CPAN you will find a number of interesting modules that assign to PL_runops at compile time, overriding the way opcodes are dispatched.

For more information on Perl internals, the best place to start is perlguts, and the wonderful perlguts illustrated.

4 comments:

James Mastros (theorbtwo) said...

Why does pp_or only --SP if PL_op->op_type == OP_OR? Why would it get called at all if op_type isn't OP_OR?

nothingmuch said...

In an expression like $x ||= foo() the container for $x remains on the stack and the op_other branch will eventually lead to an assignment to $x.

Try reading through:

perl -MO=Concise -e '$x ||= 3'

(And thanks to Vincent Pit for explaining on IRC)

Anonymous said...

Think about linking to your documentation work from http://www.perlfoundation.org/perl5/. By the way, my self-imposed task is cleaning up http://perldesignpatterns.com/?PerlAssembly and merging it into http://www.perlfoundation.org/perl5/index.cgi?optree_guts . And chipdude's git-wiki thing should probably be linked to the perlfoundation wiki... oh, what a tangled web we weave when first we fork a wiki...

nothingmuch said...

I would be probably more inclined to make this stuff into smaller chunks of actual core docs in the long term.

I will try to link when there is something a little more coherent than a bulleted list as a possible context, this series has a linear progression in mind =)