[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]
[RFC] Regular expression character classes and unicode.
As has been discussed recently and over the years we have some issues
with the regular expression engine and unicode. This mail is an
attempt to outline a feasible solution to these problems.
The issue can be reduced to two issues.
First, \d \s \w and the POSIX character classes [:upper:] and etc have
differing meanings under unicode and otherwise.
Second, use of any of these inside of a "composite" character class is
currently seriously broken. Such that a character class AND its
complement may either both match, or both fail to match a large number
of characters. This is completely broken.
There are two paths to a solution to this:
1) Completely rewrite how character classes are handled.
2) Change it so that the definition of these patterns is consistent
regardless of how the string is encoded internally.
Personally I am not at all in favour of #1, both because it is a
considerable amount of effort, and because I think the result is
undesirable. Since perl is allowed to "upgrade" a string under a wide
range of circumstances it means that these character classes are
unpredictable to the point of being generally unwise to use unless you
have completely control over the strings that they are matching. What
appears to be DWIM is not at all.
Option number #2 is my preference. Both because we already have a
solution in bleadperl (compile time disabled, but present), and
because it is a solution that we can build on to allow the programmer
to make the decision as to what semantics we should use. Therefore I
will only discuss this option here.
I propose the following:
1. Make the "non-unicode" behaviour of these patterns be the default.
This will break code, especially code doing a lot of unicode text
processing, but will be managable. In particular the POSIX character
classes will be defined EXACTLY as specified in the POSIX standard.
Those that wish to use Unicode behaviour will either have to use the
\p{} syntax, or have to use a pragma to change the behaviour.
2. Add a new special escape shortcut to mean "unicode word character",
this would have the same semantics as \w does now on a unicode string,
regardless of the internal representation of the string being matched
or the pattern being matched against. I have no idea what this
"unicode word character" should be. In another universe \u \U would be
perfect candidates IMO, but in our universe, well, \U is taken. So
what this is called is an open question.
3. Make it possible to lexically define what the meaning would be for
these classes, again regardless of the string encoding. Having
inconsistent behavior would be up to the user, but strongly
recommended against as this would mean that character class
complements would break.
So assuming we do this we have two open questions: A) what to call the
unicode version of \w, and B) what interface to use for allowing
people to choose the behaviour they want.
In order to answer B it is worth understanding how perl does character classes.
First, it has a bitmap for codepoints in the range 0-255, additionally
use locale a bitmap for each "special" character class definition has
been used, and lastly there is a list of unicode property names that
the character class can match. It is important to understand that
absent "use locale" at regex compile time a bitmap is constructed of
all the codepoints in the 0-255 range. This basically means that if
users are going to override this behavior outside of defaults provided
by the core they are going to have to supply a tuple of three values:
CLASS BITMAP PROPERTYNAME
and be responsible themselves for whether BITMAP and PROPERTYNAME mean
the same sets of characters. If they dont then char class complements
will break.
So I'm thinking of something like this:
use re 'legacy_charclass_semantics';
use re 'broken_charclass_semantics'; # same as above
would enable the current broken behaviour. The default behavior would
be equivalent to saying:
use re 'standard_charclass_semantics';
which would make \w match [A-Za-z0-9], and make POSIX charclasses
behave as the POSIX standard dictates.
Those than want unicode semantics would say:
use re 'unicode_charclass_semantics';
Possibly the user would also be able to do something like this:
use re charclass '\s' => [ $bitmap, $propertyname ], ..... ;
This would allow a user to override the behaviour as they choose.
There is one serious downside to the pragma idea. What should happen
if you embed a pattern compiled under one set of charclass semantics
into a pattern with a different set of semantics? The only way to
tackle this currently is to use regex modifiers, but this then leads
to problems, like for instance how does one embed this information
into the pattern itself without having really insane syntax. This is
actually a fairly strong argument against the "user defined" behavior,
and to restrict the possibilities to 3, legacy (broken), standard, and
unicode. As far as I understand this is more or less what PHP has
done.
So assuming that we have to have regex modifiers, what would they be?
/u /a /l perhaps? (unicode, ascii, legacy)
There is precedent for the /u in
http://de2.php.net/manual/en/reference.pcre.pattern.modifiers.php.
So maybe it should be /u for unicode and /U (default) for non-unicode
and /L for legacy.
Cheers,
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
- Follow-Ups from:
-
demerphq <demerphq@gmail.com>
Ben Morrow <ben@morrow.me.uk>
[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]