[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]
Re: [perl #58182] Unicode problem
My proposal from a week and a half ago hasn't spawned much
dissension--yet. I'll take that as a good sign, and proceed.
Here's a hodge-podge of my thoughts about it, but most important, I am
concerned about the enabling and disabling of this. I think there has
to be some way to disable it in case current code has come to rely on
what I call broken behavior.
It looks like in 5.12, Rafael wants the new mode to be default behavior.
But he also said that a switch could be added in 5.10.x to turn it on,
as long as performance doesn't suffer.
Glenn, "use bytes" doesn't mean necessarily binary. For example,
use bytes;
print lc('A'), "\n";
prints 'a'. It does mean ASCII semantics even for utf8::upgraded strings.
If there is a way to en/dis-able this mode, doesn't that have to be a
pragma? Doesn't it have to be lexically scoped? And if the answers to
these are yes, what do we do with things that are created under one mode
and then executed in the other?
Juerd wrote:
====
Pragmas have problems, especially in regular expressions. And it's very
hard to load a pragma conditionally, which makes writing version
portable code hard. Besides that, any pragma affecting regex matches
needs to be carried in qr//, which in this case means new regex flags to
indicate the behavior for (?i:...). According to dmq, adding flags is
hard.
====
I don't understand what you mean that pragmas have problems, esp in
re's. Please explain.
I had thought I had this solved for qr//i. The way I was planning to
implement this for pattern matching is quite simple. First, by changing
the existing fold table definitions to include the Unicode semantics,
the pattern matching magically starts working without any code logic
changes for all but two characters: the German sharp ss, and the micron
symbol. For these, I was planning to use the existing mechanisms to
compile the re as utf8, so it wouldn't require any new flags. Thus qr//
would be utf8 if it contained these two characters. And it works today
to match such a pattern against both non-utf8 and utf8 strings. I
haven't tested to see what happens when such a pattern is executed under
use bytes. I was presuming it did something reasonable. But now I'm
not so sure, as I've found a number of bugs in the re code in my
testing, and some are of a nature that I don't feel comfortable with my
level of knowledge about how it works to dive in and fix them. They
should be fixed anyway, and I'm hoping some expert will undertake that.
I think that once they're fixed, that I could extend them to work in
the latin1 range quite easily. So the bottom line is that qr//i may or
may not be a problem.
For the other interactions, I'm not sure there is a problem. If one
creates a string whether or not this mechanism is on, it remains 8 bits,
unless it has a code point above 255. If one operates on it while this
mechanism is on, it gets unicode semantics, which in a few cases
irretrievably convert it to utf8 because the result is above 255. If
one operates on it while this mechanism is off, you get ASCII semantics.
I don't really see a problem with that.
I think it would be easy to extend this to EBCDIC, at least the three
encodings perl has compiled-in tables for. The problem is that Rafael
said that there's no one testing on EBCDIC machines, so I couldn't know
if it worked or not before releasing it.
I'm also thinking that the Windows file name problems can be considered
independent of this, and addressed at a later time.
I also agree with Glenn's and Juerd's wording changes.
I saw nothing in my reading of the code that would lead me to touch the
utf8 flag's meaning. But I am finding weird bugs in which Perl
apparently gets mixed up about the flag. These vanish if I rearrange
the order of supposedly independent lines in the program. It looks like
it could be a wild write. I wrote a bug report [perl #59378], but I
think that the description of that is wrong.
So the bottom line for now, is I'd like to get some consensus about how
to turn it on and off (and whether to, which I think the answer is there
has to be a way to turn it off.) I guess I would claim that in 5.12,
"use bytes" could be used to turn it off. But that may be
controversial, and doesn't address backporting it.
- Follow-Ups from:
-
Glenn Linderman <perl@NevCal.com>
[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]