Software | Secret Software | Writing
Everything you need to know about Unicode
But were too terrified to ask
Two hundred and fifty five characters really ought to be enough for anyone. I've lost count of how many times I've heard this statement or similar sentiments expressed when it comes to dealing with Unicode and the more general question of character encodings.
However, this kind of ASCII-centric thinking is becoming a liability. As Harald Tveit Alvestrand put it in RFC1766, "There are a number of languages spoken by human beings in this world", and the Unicode standard was designed to be a way to make it easy for data from all kinds of environments, languages and scripts to play nice together.
Until Unicode came a long, the world was in a mess - or at least, in terms of data processing. Anyone who wanted to represent any kind of non-Latin character had to cobble together their own set of important characters to live in the top 127 character codepoints generated from when we all moved from seven-bit ASCII to eight bits. Unfortunately, when everyone has their own idea of what character 160 means depending on whether they're coming from ASCII extensions to support Hebrew, Cyrillic or the plain old European accents defined in ISO 8859-1, data interchange is impossible.
To make things worse, the Chinese, Japanese and Koreans got involved with data processing and soon realised that the 127 spare codepoints just wasn't going to put a dent in their data processing needs. With over 2,000 kanji characters in general use in Japan, plus two alphabets of about 85 characters, 255 characters starts to look a bit piffling.
If you need more than 255 characters, you're not going to be able to store each character in a single eight-bit byte. Going to sixteen-bit bytes was not an option, so they devised any number of encoding mechanisms to shoehorn huge numbers of codepoints into eight-bit bytes - EUC, JIS, Shift-JIS, Big-5, and many others. Many of these try to maintain compability with ASCII by keeping the semantics of the bottom 127 characters and using the top half as "shift" characters which introduce a wider character. Now we not only have many incompatible assignations of codepoints to characters, we have multiple incompatible ways of representing "wide" characters (more than a single byte) on disk or in memory.
Unicode came along to sort all this out. It introduced a single mapping between codepoint and character for every written script on Earth - the Unicode character set. It also proposed a number of standard ways to lay out these characters when they get bigger than a single byte - the UTF-8, UTF-16 and UTF-32 standards. (As well as some extra ones like UTF-7 that nobody seriously uses.)
Perl caught the Unicode bandwagon pretty early, thanks in part to Larry Wall's foresight (not to mention his love of Japan and its language), but many of Perl's programmers aren't on board yet. This month I'm going to try to turn you from an ASCII-phile to a Unicode-aware programmer.
Generating and munging Unicode data
First,
though,
how do we create and deal with Unicode characters?
Perl tries to make this as natural as possible.
For instance,
where chr and ord could previously deal with values up to 255,
they can now deal with values up to 4294967295,
at least on my poor old 32-bit computer.
Similarly,
string escapes have been extended to deal with characters higher than \xFF.
However,
to keep Perl compatible with old programs which may say "\x0dabc",
if you want an escape sequence longer than two characters,
you must surround the character code with curly braces,
like so:
print "\x{263a}\n"; # Prints a white smiling face
Now it may not be immediately apparent that codepoint 263a is a white smiling face, so Perl provides the charnames pragma which allows you to specify characters by name, using the \N escape:
use charnames ':full';
print "\N{WHITE SMILING FACE}\n";
Since the names themselves may not be that easy to find unless you have a copy of the Unicode Standard to hand, and may be a little unwieldy even if you do, you can also specify a short name consisting of the script name, and the character name. For instance:
use charnames ':short';
print "\N{katakana:sa}\N{katakana:i}\N{katakana:mo}\N{katakana:n}";
This will print out my name in Japanese. Of course, if I'm handling lots of Japanese, it gets rather tedious to type the katakana: every time, so we can also say:
use charnames qw(katakana);
print "\N{sa}\N{i}\N{mo}\N{n}\n";
Now we have a bunch of Unicode data to deal with. What can we do with it? Well, the first thing to note is that we can do anything we usually do with Perl. Nothing has changed now that Unicode data appears on the scene.
True, we're dealing with characters which are now wider than a single byte, but that's OK. Perl does the right thing with them:
print length("\N{sa}\N{i}\N{mo}\N{n}"); # prints 4
One neat extra thing that we can do with Unicode data is to use extended regular expressions. For instance, the Unicode Standard defines a set of properties that each character may have, and we can use regular expressions to match these properties. I deal with a kanji dictionary which contains kanji headwords, followed by a mixture of codes and indexes which mean very little to me, and phonetic readings in the katakana and hiragana scripts. We'll see later how I read in the dictionary, but I can extract the hiragana readings like this:
while (<KANJIDIC>) {
my @readings= /(\p{Hiragana}+)/g;
/(\p{Han}+)/ and print "$1: @readings";
}
"Han" is the property descriptor for a Chinese kanxi or Japanese kanji character. For a full list of Unicode properties, see the Unicode standard.
Perl's Unicode support
Perl's own support for Unicode has developed and matured over the years, after a pretty shaky start. Not only that, but the nature of the support and what Perl has offered in terms of Unicode support has changed; writing with the benefit of hindsight, I can now tell you about what Perl can do at the moment - regardless of what it was supposed to be all along. We can mainly ignore all of the motivations and all of the little hacks along the way, and talk about the real world.
But first, a bit of history, so we are clear on what's possible with particular Perl versions. The first Perl release to support any kind of Unicode data was Perl 5.6.0. You could generate Unicode characters as we've discussed above, and you could print that data out, more or less, but thre was no other way of getting Unicode data from files or from other sources into your application as Unicode. This was a bit useless, really. It also didn't help that Perl didn't have a clear strategy for what happened when Unicode data hit non-Unicode data, and it's here that an important distinction arises, which we'll look at in a second.
These problems were mostly sorted out through 5.6.1 and done by Perl 5.6.2, but still the problem of getting Unicode data into Perl remained; work begin in 5.6.1-or-so to fix this using the Encode module, and this has only been usable since around Perl 5.8.2. So while it is possible to do some Unicode-related work in 5.6.2 if you're careful, real Unicode applications really ought to be based on 5.8.2 and above.
The big lie
I've been claiming that Perl now supports Unicode, but to be honest, that's a little bit of a lie. Perl supports data encoded in the UTF8 representation, and knows what to do with it if that data is Unicode. It doesn't ever know whether that data really does represent Unicode or not.
Let's suppose we're dealing with a string of Japanese data, (as I reasonably often do) and let's further suppose we know nothing about Unicode at all. We're just an ordinary Perl 5 application merrily handling Japanese text, which is encoded in the EUC encoding often used for Unix-based Japanese data processing:
my $hello = "\272\243\306\374\244\317\241\242\300\244\263\246";
print $hello, "\n"; # Prints "Heloo world" on an EUC terminal
print length($hello) # 12 bytes
Now we want to play in the Unicode world, and add our familiar smiley face to the end of our "hello world" greeting:
my $smiley = "\N{WHITE SMILING FACE}";
print $hello . $smiley;
At this point we have a problem. Perl has absolutely no idea that this data is Japanese EUC. It could be in any legacy encoding under the sun. And now we want to append a Unicode string to the end. What's Perl going to do?
Well, there's very little it can do. It knows that the string on the right is Unicode data, but it can't assume very much about the string on the left. What it does do is rely on a flag which marks a string as being represented internally as UTF8. It further assumes that when it sees a string that isn't represented as UTF8, this should be treated as ISO-8859-1. Since our Japanese data isn't ISO-8859-1, madness will soon ensue.
Perl will "upgrade" the string to UTF8, but it doesn't know how to convert it to Unicode - what we end up with is some UTF8-encoded Japanese EUC data, not UTF8-encoded Unicode data, and this is no good to man nor beast.
Encode - dealing with legacy data
So what can we do about legacy data, which isn't ISO 8859-1? At this point, that Encode module we mentioned earlier becomes useful. We can't tell Perl what encoding we're dealing with, but we can ask Perl to translate everything to Unicode for us, and use that as a lingua franca - one of the things it was precisely designed to do.
Let's take that same EUC string - the Japanese for "Hello, world":
my $hello = "\272\243\306\374\244\317\241\242\300\244\263\246";
and use Encode to translate it from Japanese-EUC into Unicode:
my $hello_uni = decode("euc-jp", $hello);
Where before we were dealing with the string as a binary sequence of bytes, we're now dealing with it as Unicode characters. This is not just our useless EUC-UTF8 mix, but real, honest-to-goodness Unicode.
print length($hello);
print length($hello_uni);
At this point, all of our Unicode slicing-and-dicing, including Unicode-aware regular expressions will work properly on $hello_uni.
Once we've finished munging our data, of course, we might want to put it back into the EUC format we began with. Once again, Encode helps out, with the predictably-named encode routine.
open OUT, ">sliced-hello.euc" or die $!;
print OUT encode("euc-jp", $hello_uni);
To find out what encodings Encode supports, you can say:
use Encode;
print Encode->encodings(":all");
So, for instance, we might want to create ourselves a Unicode transcoder - that is, something which takes data in one format, and spits it out in another encoding. This is something I end up doing rather often, so I came up with the following program:
#!/usr/bin/perl
use Encode;
my ($from, $to) = splice(@ARGV,0,2); ($from && $to) or usage();
while (<>) {
my $unicode;
eval { $unicode = decode($from, $_) };
if ($@ =~ /unknown encoding/i) { usage() }
eval { print encode($to, $unicode) };
if ($@ =~ /unknown encoding/i) { usage() }
}
sub usage {
die q{
$0 - $0 <to> <from> [<file> ...]
Act as a filter, encoding data from the first character set to the
second. Available character sets are:
}, map { sprintf("\t%s\n", $_) } Encode->encodings(":all");
}
However, there's yet a neater way to do things. If you have Encode available, you also have the PerlIO module, which hooks into Perl's IO streams to control how file access is done. PerlIO is a mechanism that can be used to add filters onto a filehandle: one which automatically strips the newlines, for instance, or reads files which are gzipped, or even bypasses standard IO altogether and reads files directly into memory with mmap. Encode hooks into this PerlIO framework, to read and write files through character set encoding or decoding. For instance, to read a Russian file from a Windows computer, using the koi-8 encoding, you can say:
open IN, "<:encoding(koi8-r)", "russian.txt" or die $!;
And to write it out again for use on a Rusian Mac running System 9, you would say:
open OUT, ">:encoding(MacCyrillic)", "russian.mac" or die $!;
while (<IN>) { print OUT $_ }
So if you're content with a little simplicity, you can slim your transcoder down to:
#!/usr/bin/perl -p
BEGIN {
binmode(STDIN, ":encoding(".shift(@ARGV).")");
binmode(STDOUT, ":encoding(".shift(@ARGV).")");
}
Dealing with the outside world
The final piece of the Unicode puzzle comes when you need to send or receive UTF8 data from files, or send to other applications which may or may not know anything about Unicode themselves - such as databases which just store the data, not caring about its semantics.
To store data as Unicode is easy enough - you just do it. If you write to a filehandle with data that Perl thinks contains Unicode characters - that is, has the UTF8 flag set - Perl will write the UTF8 representation of the string to the file:
open OUT, ">smiley.txt" or die $!;
use charnames ':full';
print OUT "\N{WHITE SMILING FACE}\n";
This will work just fine, but Perl will issue a warning when any multi-byte characters are emitted:
Wide character in print at smile.pl line 3.
In order to tell Perl that it's OK to send the output as UTF8, you can set a flag on the filehandle:
binmode(OUT, ":utf8");
Similarly, if you have a file that contains UTF8 data, that you want to recognise as such, you can set the same flag, using binmode again, on the input filehandle:
open IN, "smiley.txt" or die $!;
$a = <IN>; chomp $a;
print length $a # 3 bytes - not marked as Unicode
close IN;
open IN, "smiley.txt" or die $!;
binmode IN, ":utf8";
$a = <IN>; chomp $a;
print length $a # 1 character - marked as Unicode
Indeed, this is the usual and best way of getting Unicode data into a Perl application. Unfortunately, files are not the only places where you might receive UTF8-encoded strings. We might read data from a socket, or receive it via DBI from a database or, as I had to do recently, read it from the middle of another binary file.
This last case is particularly interesting - you can't read the whole binary file as though it were UTF8, since you really want to treat it as a stream of bytes; however, when you get to the part representing a string, you need Perl to treat it as a UTF8 string, and work character-wise.
The way to do this is to use utf8 as just another encoding: you have data that you know is in UTF8, and you want Perl to turn it into Unicode data, so you say:
# String is packed with the length first
my $len;
read(BIN, $len, 4);
my $len_bytes = unpack("N", $len);
# Now read the string
my $str;
read(BIN, $str, $len_bytes);
# Make a UTF8-aware copy
my $utf8 = decode("utf8", $str);
There is another way to do this, which is a little messier, but I recommend it none the less. Encode can optionally export a subroutine called _utf8_on. As its name implies, this is an internal routine, in that it directly messes with Perl's internal representation of the string, turning on the bit that says this data is UTF8. I prefer this, however, because it is efficient, it is self-documenting, and it's easier to understand than trying to work out from what decode("utf8", $str) is decoding into what.
Finally, you may have to deal with situations where you don't want to end up with your Unicode data as Unicode. For instance, you have a bunch of database records about your company's contacts in Eastern Europe, that you need to have inserted into your master contacts database. Unfortunately, even though you are an educated and progressive programmer, and have stored everything correctly in Unicode, Headquarters is full of people for whom ISO-8859-1 is a recent advance over 7-bit ASCII. What will you do for your friend in Cžrny?
Here is a problem where you are guaranteed to lose information. You want to represent a character that simply can't be represented in the character set you have to deal with. Your choice is how much information you want to lose. If you take the obvious approach, and say
print decode("iso-8859-1", "C\x{17e}rny");
Encode will helpfully substitute in a "substitute character" for the letter which cannot be represented, and you'll end up with "C?rny". This is acceptable so far, but you should be thankful that you're not dealing with completely non-Latin alphabets, as mail to the Korean city of ??? is not guaranteed to arrive.
If you neeed to lose less information, you could try the wonderful Text::Unidecode module, which tries to turn Unicode strings into "plain text". For example:
use Text::Unidecode;
print unidecode("\x{d478}\x{c0b0}"); # pusan
It's not perfect, but it's certainly better than a stream of question-marks. When you still need to communicate with ASCII dinosaurs, Text::Unidecode will give them pretty much what they deserve.
Unicode for all!
Thankfully, though, the world is getting more and more Unicode aware. As we move into global business, working with more countries, languages and scripts, the importance of Unicode will continue to grow. Most of the time, it takes very few changes to make an application aware of the possibility of Unicode text, or to deal with that text when it arises, that there's really no excuse for making your code Unicode compliant - do it now, and it'll save time and effort when the time comes.