Tuesday, July 28, 2009

SV *

In my last internals post I mentioned that most of the data in Perl revolves around the SV struct.

The excellent PerlGuts Illustrated describes these structures in great detail, and of course there's the two main documents in Perl's own reference, perlguts which tells you about them and perlapi which tells you what you can do with them.

This post is sort of like PerlGuts for dummies, intended to lightly touch on the SV struct and then explain some typical usage but not much more.

Unfortunately this is going to a hard read. In early revisions I tried to show how to write a very simple XS function, but there are just too many details, so I will save that for a later time.

What's in an SV

diagram of an SV structure

I've stolen this diagram of an SV from PerlGuts Illustrated, but just this is only the tip of the iceberg. Click the link if you haven't already. Really. JFDI. I'll wait.

SV stands for scalar value. This C struct represents every singular value available in Perl, numbers, strings, references. The SV struct is technically what's called a tagged union, it contains a field that denotes that value's type, and the meaning of its 4 fields can change depending on that type.

The first field is called ANY. This contains a pointer to the actual data. We'll get back to this one in a second.

The second field is the reference count. This is just a simple integer. You can increment it by calling SvREFCNT_inc, and decrement it using SvREFCNT_dec. New SVs have a reference count of 1, and if you SvREFCNT_dec to 0 then the SV is destroyed.

The third and forth fields are the flags and the type. Again, for details refer to illguts, but the gist of it is that the type and the flags tell the various macros what structure the ANY field points to.

The simplest value is an unintialized SV. In this case the type is SVt_NULL and the ANY field is a null pointer.

So what about an SV containing a number?

In the Perl source code the type IV is an alias to the native integer type. It's pretty much the same as long or long long depending on the architecture and the arguments to Configure.

If you call sv_setiv(sv, 42) then the SV's ANY field will be set up such that it points at an an IV whose value is 42, and the SvTYPE will be SVt_IV, and set the IOK flag saying that the IVX slot has a valid value (this is actually not entirely accurate, so again, see illguts for more details).

The ANY field actually is set by the SvUPGRADE macro. If necessary this macro will allocate additional structures require for the target type, and changes the SvTYPE. In this case it will ensure storage area for the IV is available.

When the IV is actually stored in that structure the IOK flag will be enabled, signifying that there is a valid integer stored in this SV.

To extract an IV value from SV you use the SvIV macro. This macro resolves to an expression that will return a native integer type by from the structure pointed to by the ANY field.

So what happens when you call SvIV(sv) but sv actually contains a string? What SvIV actually does is check the IOK flag, which means that the ANY field points to a structure with a valid IV in it. If the flag is not set, then an IV will be generated (and stored) by numifying the value.

In the case of a string like "42" (whose type is SVt_PV), the value will be parsed using the grok_number function. The SVt_PV will then be upgraded to an SVt_PVIV, which has slots for for a string as well as an integer.

The next time SvIV is called on this SV the IOK flag will be set from the previous invocation, so it will just return the IV without computing anything.

So in effect, there is an SvTYPE for every kind of value value, and also for every sensible combination of values (in our second example we make an SV that is both an integer and a string at the same time).

As is the case everywhere in Perl, there is a rich set of macros to manipulate the structures, so that you generally don't have to think about the various flags and types, they are an implementation detail. You tell the macros what you want to get out of an SV and it does the hard work for you.

If you haven't already, please take a few minutes to at least skim illguts. It's not important to know all the different variations and wha they mean, but you should know what a reference looks like in C land, and how things like arrays are represented.

Stacks

Perl 5 is a stack based virtual machine. This means that in order to pass data around, pointers to SVs are pushed onto a stack by the caller, and taken off the stack by the code they are called. The called code does its thing, and then overwrites its parameters with pointers to the SVs that are the result of the computation.

The two main stacks that pass data around are "the" stack and the mark stack. SVs are pushed onto the stack, while the mark stack contains pointers to interesting places on the stack.

In order to push a value, you use the XPUSHs macro or a variant. X stands for "extend", it makes sure the stack has enough room for the value before adding it.

So what's the mark stack for? When you call a subroutine, like foo($x, $y) a new mark is pushed, then $x, then $y. The body of foo gets a variable sized list of parameters by looking from TOPMARK to the head of the stack.

To get values you can use the SP macro. $x is available as SP(0) (though there are many convenience macros that are usually quicker than SP).

As a side note, recall that values pushed to the stack are passed as aliases. This is because a pointer to the SV is copied to the head of stack (as opposed to a copy of the SV itself).

So what about reference counting? Pushing parameters is fine (there's still a counted reference in the call site), but when you're pushing a return value you've got a problem: if you SvREFCNT_dec the value then it will be destroyed prematurely, but if you don't then it will be leak.

Fortunately Perl has yet another stack to solve this, the tmps stack. When you call sv_2mortal(sv) a pointer to sv is saved on the tmps stack, and on the next call to the FREETMPS macro all of these mortal SVs will have their reference count decremented.

Assignment and copies

In order to assign the value of SV to another, use the SvSetMagicSV to copy the data over. It doesn't copy the pointer in ANY, but the values themselves.

One important thing to keep in mind is that when assign copy a reference, like:

my $x = \%hash;
my $y = $x;

$y is actually copied, it doesn't share any data with $x. The two SVs are both pointing at the same HV (%hash).

This copying is also done when returning values from a subroutine.

It's important to keep in mind that most operations in Perl involve copying the SV structure, because assigning an SV pointer actually creates an alias. Accidentally creating aliases is typical issue I have as an XS n00b.

In XS we usually use newSVsv to create a copy, because often the target SV does not yet exist. This is the same as creating a new empty SV and calling SvSetMagicSV on it.

If you've ever seen the error Bizarre copy of %s? That's what happens when sv_setsv is called on a value that isn't a simple SV, such as a number, string or reference. Correctly copying arrays and hashes involves making copies of all the SVs inside them and reinserting those copies into a new AV or HV.

References

So as I just mentioned, there are 3 valid things to copy around using setsv: undef, simple values (strings and numbers), and references. Complex data structures are passed by reference, or by evaluating in list context in a way that simply puts every element SV on the stack.

A reference is an SV whose ANY field points to an SV * which poinsts at another SV.

Calling the SvRV macro on an SV with ROK set returns the target SV.

In the case of e.g. \@array the SV will return true for SvROK(sv), and SvRV returns the AV of @array.

Casting

When working with references casting becomes important. Structs like AV, HV etc have the same fields as SV but the operations you can perform on them are different. The various API functions operating on these types require pointers of the correct type, which means you often need to cast points from one type to another.

For intsance, if we've received a reference to an array in ref_sv, we could do something like this:

SV *ref_sv; /* assume ROK is set */

SV *array_sv = SvRV(ref_sv);
assert( SvTYPE(array_sv) == SVt_PVAV );

AV *array = (AV *)array_sv;
SV **elem = av_fetch(array, i, FALSE);

Fin

This post is getting very long so I will STFU now. In the next post I'll review the basics of XS boilerplate, and that plus this post should be enough to get started on reading/writing simple XS code.

No comments: