Monday, October 5, 2009

Are Filehandles Objects?

Perl has a very confusing set of behaviors for treating filehandles as objects.

ADHD Summary

Globs which contain open handles can be treated as objects, even if though they aren't blessed.

Always load IO::Handle and FileHandle, to allow the method syntax.

Whenever you are using filehandles, use the method syntax, regardless of whether it's a real handle or a fake one. A fake handle that works with the builtins needs to jump through some nasty hoops.

Whenever you are creating fake handles, the aforementioned hopps are that you should tie *{$self} (or return a tied glob from the *{} overload), so that the builtins will know to call your object's methods through your TIEHANDLE implementation.

The Long Story

There are several potentially overlapping data types that can be used to perform IO.

  • Unblessed globs containing IO objects
  • Blessed globs containing IO objects
  • Blessed IO objects
  • Objects resembling IO objects (i.e. $obj->can("print")), which aren't necessarily globs
  • Tied globs

and there are two ways to use these data types, but some types only support one method:

  • method calls ($fh->print("foo"))
  • builtins (print $fh, "foo")

Lastly, there are a number of built in classes (which are not loaded by default, but are in core):

When you open a standard filehandle:

use autodie;

open my $fh, "<", $filename;

the variable $fh contains an unblessed reference to a type glob. This type glob contains an IO reference in the IO slot, that is blessed into the class FileHandle by default.

The IO object is blessed even if FileHandle is not loaded.

If it is loaded, then both ways of using the handle are allowed:

# builtin
read $fh, my $var, 4096;

# as method
$fh->read(my $var, 4096);

These two forms behave similarly. One could even be led to believe that the first form is actually treated as indirect method syntax. This is unfortunately very far from the truth.

In the first form the read is executed directly. In the second form the read method would be invoked on the unblessed glob. However, instead of throwing the usual Can't call method "read" on unblessed reference error, Perl's method_common routine (which implements method dispatch) special cases globs with an IO slot to actually dispatch the method using the class of *{$fh}{IO}. In a way this is very similar to how autobox works.

If you remembered to use FileHandle then this should result in a successful dispatch to IO::Handle::read, which actually delegates to the read builtin:

sub read {
    @_ == 3 || @_ == 4 or croak 'usage: $io->read(BUF, LEN [, OFFSET])';
    read($_[0], $_[1], $_[2], $_[3] || 0);
}

When you create nonstandard IO objects, this breaks down:

{
    package MyHandle;

    sub read {
        my ( $self, undef, $length, $offset ) = @_;

        substr($_[1], $offset, $length) = ...;
    }
}

because now $myhandle->read(...) will work as expected, but read($myhandle, ...) will not. If it is a blessed glob the error will be read() on unopened filehandle, and for other data types the error will be Not a GLOB reference.

Tied Handles

Tied handles are very similar to built in handles, they contain an IO slot with a blessed object (this time the default is the FileHandle class, a subclass of IO::Handle that has a few additional methods like support for seeking.

The IO object is marked as a tied data structure so that the builtin opcodes will delegate to some object implementing the TIEHANDLE interface.

A method call on a tied object will therefore invoke the method on the FileHandle class, which will delegate to to the builtin, which will delegate to a method on the object.

Because of this, classes like IO::String employ a clever trick to allow the builtins to be used:

my $fh = IO::String->new($string);

# readline builtin calls getline method
while ( defined( my $line = <$fh> ) ) {
    ...
}

# print builtin calls print method
print $fh "foo";

IO::String::new ties the new handle to itself[1], and sets up the extra glue:

sub new
{
    my $class = shift;
    my $self = bless Symbol::gensym(), ref($class) || $class;
    tie *$self, $self;
    $self->open(@_);
    return $self;
}

The TIEHANDLE API is set up by aliasing symbols:

*READ   = \&read;

Effectively this reverses the the way it normally works. Instead of IO::Handle methods delegating to the builtin ops, the builtin ops delegate to the IO::String methods.

IO::Handle::Util provides a io_to_glob helper function which produces a tied unblessed glob that delegates to the methods of an IO object. This function is then used to implement *{} overloading. This allows non glob handles to automatically create a working glob as necessary, without needing to implement the tie kludge manually.

Conclusion

When working with standard or nonstandard handle types, method syntax always works (provided IO::Handle and FileHandle are loaded), but the builtin syntax only works for tied handles, so when using a filehandle I prefer the method syntax.

It also makes the silly print {$fh} idiom unnecessary, since direct method syntax isn't ambiguous.

Performance is a non issue, the extra overhead is nothing compared to PerlIO indirection, ties, and making the actual system calls.

However, when creating nonstandard IO objects, you should probably provide a tie fallback so that code that doesn't use method syntax will not die with strange errors (or worse, violate the encapsulation of your handle object and just work on the internal glob structure directly).

This is one of my least favourite parts in Perl. It's such a horrible cascade of kludges. In the spirit of Postel's Law, consistently using methods is the conservative thing to do (it works in all cases), and providing a tie based fallback is a way to be liberal about what you accept.

[1] Surprisingly this doesn't leak, even though weaken(tie(*$self, $self)) is not used. I suspect there is a special case that prevents the refcount increment if the object is tied to itself. See also Tie::Util

4 comments:

kappa said...

Interesting. I have always thought that print $fh $string is interpreted as an indirect object notation of method call rather than built-in function call.

nothingmuch said...

Yeah it looks and feels like one, but it isn't... See this optree dump

ostbey said...

You are right about your suspicion that there is a special case that prevents the refcount increment if the object is tied to itself.

At least for globs this is true.

This has been changed several times in the Perl core, but was eventually kept despite its kludgeness (as you call it rightly, "a horrible cascade of kludges") because there were modules such as e.g. Data::Locations which depend on it (and which stopped working several times after such changes to the core).

At that time the Perl 5 Porters even wondered who would ever want self-tied globs and what for, but you have precisely shown why: to allow the builtin syntax AND the method syntax to both work.

miyagawa said...

Just a heads up: perl 5.11.3 (and .4 obviously) broke this.

http://gist.github.com/290443 This test work on 5.8, 5.10 and 5.11.2 but fails on 5.11.3. I just brought it up on #p5p to see whether this is an intended bug incompatibilities...