[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]
Volunteer for fixing [perl #58182], the "Unicode" bug
I'm the person who submitted this bug report. I think this bug should
be fixed in Perl 5, and I'm volunteering to do it. Towards that end, I
downloaded the Perl 5.10 source and hacked up an experimental version
that seems to fix it. And now I've joined this list to see how to
proceed. I don't know the protocol involved, so I'll just jump in, and
hopefully that will be all right.
To refresh your memory, the current implementation of perl on non-EBCDIC
machines is problematic for characters in the range 128-255 when no
locale is set.
The slides from the talk "Working around *the* Unicode bug" during
YAPC::Europe 2007 in Vienna:
http://juerd.nl/files/slides/2007yapceu/unicodesemantics.html
give more cases of problems than were in my bug report.
The crux of the problem is that on non-EBCDIC machines, in the absence
of locale, in order to have meaningful semantics, a character (or code
point) has to be stored in utf8, except in pattern matching the \h, \H,
\v and \V or any of the \p{} patterns. (This leads to an anomaly with
the no-break space which is considered to be horizontal space (\h), but
not space (\s).) (The characters also always have base semantics of
having an ordinal number, and also of being not-a-anything (meaning that
they all pattern match \W, \D, \S, [[:^punct]], etc.))
Perl stores characters as utf8 automatically if a string contains any
code points above 255, and it is trivially true for ascii code points.
That leaves a hole-in-the-doughnut of characters between 128 and 255
with behavior that varies depending on whether they are stored as utf8
or not. This is contrary, for example, to the Camel book: "character
semantics are preserved at an abstract level regardless of
representation" (p.403). (How they get stored depends on how they were
input, or whether or not they are part of a longer string containing
code points larger than 255, or if they have been explicitly set by
using utf8::upgrade or utf8::downgrade.)
I know of three areas where this leads to problems.
The first is the pattern matching already alluded to. This is at least
documented (though somewhat confusingly). And one can use the \p{}
constructs to avoid the issue.
The second is case changing functions, like lcfirst() or \U in pattern
substitutions.
And the third is ignoring case in pattern matches.
There may be others which I haven't looked for yet. I think, for
example, that quotemeta() will escape all these characters, though I
don't believe that this causes a real problem.
One response I got to my bug report was that a lot of code depends on
things working the way they currently do. I'm wondering if that applies
to all three of the areas, or just the first?
Also, from reading the perl source, it appears to me that EBCDIC
machines may work differently (and more correctly to my way of thinking)
than Ascii-ish ones.
An idea I've had is to add a pragma like "use latin1", or maybe "use
locale unicode", or something else as a way of not breaking existing
application code.
Anyway, I'm hoping to get some sort of fix in for this. In my
experimental implementation (which currently doesn't change EBCDIC
handling), it is mostly just extending the existing definitions of ascii
semantics to include the 128..255 latin1 range. Code logic changes were
required only in the uc and ucfirst functions (to accommodate 3
characters which require special handling), and in the regular
expression compilation (to accommodate 2 characters which need special
handling). Obviously, in my ignorance, I may be missing things that
others can enlighten me on.
So I'd like to know how to proceed
Karl Williamson
- Follow-Ups from:
-
Glenn Linderman <perl@NevCal.com>
andreas.koenig.7os6VVqR@franz.ak.mind.de (Andreas J. Koenig)
"Rafael Garcia-Suarez" <rgarciasuarez@gmail.com>
Juerd Waalboer <juerd@convolution.nl>
"Eric Brine" <ikegami@adaelis.com>
- References to:
-
"H.Merijn Brand" <h.m.brand@xs4all.nl>
Jesse Vincent <jesse@fsck.com>
[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]