Wednesday, May 13, 2009

Devel::StringInfo

The next installment in the modules I haven't talked about series is Devel::StringInfo.

Devel::StringInfo collects information about a string to determine what encoding it is in, what other encodings it could be in, and what unicode string would it be if reinterpreted as such.

Encoding confusion usually happens because Perl stupidly assumes the default encoding for all undecoded strings is Latin-1, so when combining a string of bytes which are valid UTF-8 data with a with a Unicode character string, the bytestring is decoded as Latin-1 instead of UTF-8 as most people expect. Since virtually any byte sequence is valid Latin-1 this is a silent conversion whose side effects are usually observed very far away. To make things worse, when printing out Unicode strings without an explicit conversion, they are encoded as UTF-8, which means the data will not survive a round trip.

miyagawa's Encode::DoubleEncodedUTF8 module can be used to work around this problem, but you are better off identifying the cause and fixing it.

By using Devel::StringInfo to gather information about your strings you can identify byte strings that should be decoded (is_utf8 is false, but the strings appears to have UTF-8 data in them). When concatenating suspect strings together test both the inputs and the resulting string.

Perl's Unicode handling is very confusing because of the relationship between ASCII, UTF-8 and ISO 8859-1 (Latin-1). These encodings all overlap for the bottom 127 code points, so unless you are using strings in a language other than English your code might be wrong but appear to be working correctly.

In my opinion the best solution is to always decode as early as possible, and encode as late as possible. binmode($fh,":utf8") binmode($fh,":encoding(utf8)") is handy for this (Update: see discussion in comments and also read this page on utf8 and PerlIO). Also try and keep your data encoded in either UTF-8 or ASCII. Any handling of other encodings should be clearly and obviously marked in the source code, and decoded into Unicode strings as early as possible. Usually you are not processing binary data. It's up to you to tell Perl which data is actually text.

A few more notes:

  • Don't forget to use utf8 when you have UTF-8 encoded string literals in your source code. The default encoding for Perl source code is unfortunately Latin-1.
  • Read and understand perlunitut. There is no magical way to cargo cult something and end up with working code. You must know how Perl treats your data.
  • Remember that perl implicitly decodes when you combine string and binary data and implicitly encodes when you print to a filehandle.
  • encoding::warnings will warn you if you implicitly decode data from bytes to unicode characters, but you need to remember to use it anywhere you handle strings.
  • For more advice see Juerd's perluniadvice page.

4 comments:

Aristotle said...

One bit of advice Juerd gives is… not to use the :utf8 I/O layer. Use :encoding(UTF-8) instead. The difference is that :utf8 is equivalent to doing _utf8_on() whereas :encoding(UTF-8) equates to decode_utf8().

nothingmuch said...

You learn something every day =)

For those who don't know the difference, _utf8_on doesn't actually validate the data, which means that if the data in the filehandle is not actually valid UTF-8 you're in trouble ;-)

Aristotle said...

Yup; and for those who are unaware: this has security implications, so it’s not merely some obscure encoding theory issue.

Code safely.

J. Shirley said...

pssst, your update has :encoding(utf8) and not :encoding(UTF-8)