[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]

Unicode Handling HOWTO



As usual, comments welcome. How can we get this out to XS people?

=head1 Unicode Support

Perl 5.6.0 introduced Unicode support. It's important for porters and XS
writers to understand this support and make sure that the code they
write does not corrupt Unicode data.

=head2 What B<is> Unicode, anyway?

In the olden, less enlightened times, we all used to use ASCII. Most of
us did, anyway. The big problem with ASCII is that it's American. Well,
no, that's not actually the problem; the problem is that it's not
particularly useful for people who don't use the Roman alphabet. What
used to happen was that particular languages would stick their own
alphabet in the upper range of the sequence, between 128 and 255. Of
course, we then ended up with plenty of variants that weren't quite
ASCII, and the whole point of it being a standard was lost.

Worse still, if you've got a language like Chinese or
Japanese that has hundreds or thousands of characters, then you really
can't fit them into a mere 256, so they had to forget about ASCII
altogether, and build their own systems using pairs of numbers to refer
to one character.

To fix this, some people formed Unicode, Inc. and
produced a new character set containing all the characters you can
possibly think of and more. There are several ways of representing these
characters, and the one Perl uses is called UTF8. UTF8 uses
a variable number of bytes to represent a character, instead of just
one. You can learn more about Unicode at
L<http://www.unicode.org/|http://www.unicode.org/>

=head2 How can I recognise a UTF8 string?

You can't. This is because UTF8 data is stored in bytes just like
non-UTF8 data. The Unicode character 200, (C<0xC8> for you hex types)
capital E with a grave accent, is represented by the two bytes
C<v196.172>. Unfortunately, the non-Unicode string C<chr(196).chr(172)>
has that byte sequence as well. So you can't tell just by looking - this
is what makes Unicode input an interesting problem.

The API function C<is_utf8_string> can help; it'll tell you if a string
contains only valid UTF8 characters. However, it can't do the work for
you. On a character-by-character basis, C<is_utf8_char> will tell you
whether the current character in a string is valid UTF8.

=head2 How does UTF8 represent Unicode characters?

As mentioned above, UTF8 uses a variable number of bytes to store a
character. Characters with values 1...128 are stored in one byte, just
like good ol' ASCII. Character 129 is stored as C<v194.129>; this
contines up to character 191, which is C<v194.191>. Now we've run out of
bits (191 is binary C<10111111>) so we move on; 192 is C<v195.128>. And
so it goes on, moving to three bytes at character 2048.

Assuming you know you're dealing with a UTF8 string, you can find out
how long the first character in it is with the C<UTF8SKIP> macro:

    char *utf = "\305\233\340\240\201";
    I32 len;
    
    len = UTF8SKIP(utf); /* len is 2 here */
    utf += len;
    len = UTF8SKIP(utf); /* len is 3 here */

Another way to skip over characters in a UTF8 string is to use
C<utf8_hop>, which takes a string and a number of characters to skip
over. You're on your own about bounds checking, though, so don't use it
lightly.

All bytes in a multi-byte UTF8 character will have the high bit set, so
you can test if you need to do something special with this character
like this:

    UV uv;

    if (utf & 0x80)
        /* Must treat this as UTF8 */
        uv = utf8_to_uv(utf);
    else
        /* OK to treat this character as a byte */
        uv = *utf;

You can also see in that example that we use C<utf8_to_uv> to get the
value of the character; the inverse function C<uv_to_utf8> is available
for putting a UV into UTF8:

    if (uv > 0x80)
        /* Must treat this as UTF8 */
        utf8 = uv_to_utf8(utf8, uv);
    else
        /* OK to treat this character as a byte */
        *utf8++ = uv;

You B<must> convert characters to UVs using the above functions if
you're ever in a situation where you have to match UTF8 and non-UTF8
characters. You may not skip over UTF8 characters in this case. If you
do this, you'll lose the ability to match hi-bit non-UTF8 characters;
for instance, if your UTF8 string contains C<v196.172>, and you skip
that character, you can never match a C<chr(200)> in a non-UTF8 string.
So don't do that!

=head2 How does Perl store UTF8 strings?

Currently, Perl deals with Unicode strings and non-Unicode strings
slightly differently. If a string has been identified as being UTF-8
encoded, Perl will set a flag in the SV, C<SVf_UTF8>. You can check and
manipulate this flag with the following macros:

    SvUTF8(sv)
    SvUTF8_on(sv)
    SvUTF8_off(sv)

This flag has an important effect on Perl's treatment of the string: if
Unicode data is not properly distinguished, regular expressions,
C<length>, C<substr> and other string handling operations will have
undesirable results.

The problem comes when you have, for instance, a string that isn't
flagged is UTF8, and contains a byte sequence that could be UTF8 -
especially when combining non-UTF8 and UTF8 strings.

Never forget that the C<SVf_UTF8> flag is separate to the PV value; you
need be sure you don't accidentally knock it off while you're
manipulating SVs. More specifically, you cannot expect to do this:

    SV *sv;
    SV *nsv;
    STRLEN len;
    char *p;

    p = SvPV(sv, len);
    frobnicate(p);
    nsv = newSVpvn(p, len);

The C<char*> string does not tell you the whole story, and you can't
copy or reconstruct an SV just by copying the string value. Check if the
old SV has the UTF8 flag set, and act accordingly:

    p = SvPV(sv, len);
    frobnicate(p);
    nsv = newSVpvn(p, len);
    if (SvUTF8(sv))
        SvUTF8_on(nsv);

In fact, your C<frobnicate> function should be made aware of whether or
not it's dealing with UTF8 data, so that it can handle the string
appropriately.

=head2 How do I convert a string to UTF8?

If you're mixing UTF8 and non-UTF8 strings, you might find it necessary
to upgrade one of the strings to UTF8. If you've got an SV, the easiest
way to do this is:

    sv_utf8_upgrade(sv);

However, you must not do this, for example:

    if (!SvUTF8(left))
        sv_utf8_upgrade(left);

If you do this in a binary operator, you will actually change one of the
strings that came into the operator, and, while it shouldn't be noticable
by the end user, it can cause problems.

Instead, C<bytes_to_utf8> will give you a UTF8-encoded B<copy> of its
string argument. This is useful for having the data available for
comparisons and so on, without harming the orginal SV. There's also
C<utf8_to_bytes> to go the other way, but naturally, this will fail if
the string contains any characters above 255 that can't be represented
in a single byte.

=head2 Is there anything else I need to know?

Not really. Just remember these things:

=over 3

=item *

There's no way to tell if a string is UTF8 or not. You can tell if an SV
is UTF8 by looking at is C<SvUTF8> flag. Don't forget to set the flag if
something should be UTF8. Treat the flag as part of the PV, even though
it's not - if you pass on the PV to somewhere, pass on the flag too.

=item *

If a string is UTF8, B<always> use C<utf8_to_uv> to get at the value,
unless C<!(*s & 0x80)> in which case you can use C<*s>

=item *

When writing to a UTF8 string, B<always> use C<uv_to_utf8>, unless 
C<uv < 0x80> in which case you can use C<*s = uv>.

=item *

Mixing UTF8 and non-UTF8 strings is tricky. Use C<bytes_to_utf8> to get
a new string which is UTF8 encoded. There are tricks you can use to
delay deciding whether you need to use a UTF8 string until you get to a
high character - C<HALF_UPGRADE> is one of those.

=back

And that's it.

-- 
Familiarity breeds facility.
        -- Megahal (trained on asr), 1998-11-06


Follow-Ups from:
Dominic Dunlop <domo@computer.org>

[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]