[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]

Re: program to look at char class complements [perl #60156]



Tom Christiansen wrote:
> In-Reply-To: Message from karl williamson <public@khwilliamson.com> 
>    of "Wed, 29 Oct 2008 10:21:33 MDT." <49088D8D.3050202@khwilliamson.com> 
> 
>> Tom Christiansen wrote:
> 
>>> PRESCRIPT: Karl, my program at the end below should be good for 
>>>            sniffing out \p and \P mutual-exclusion failure bugs 
>>>            such as I believe you recently reported.
>>>
> 
>> Thanks for the program.  My bug reports have mostly come from
>> running something similar, but not with nearly as many of the
>> classes as you did.
> 
>> I'm guessing you didn't run it past 127, because it fails immediately 
>> with a Malformed UTF-8 character (fatal) at (eval 3) line 2.
> 
> Yes, that's because it intentionally assumed ASCII only.  The amended
> program and results follow.  You should now be to use this program
> directly for your testing.
> 
>> I actually don't think there are any issues with the \p and \P,
>> because those go out and use auto-constructed files.  The
>> problems I have found have been in the posix classes and
>> entirely in the 128-255 range.
> 
> It's not just there.  It's a bug in negating of POSIX char classes.
> I can elicit errors at codepoints <128, even with Unicode semantics on.
> 
> Specifically, just runniing POSIX charclass tests alone:
> 
>     REPORT: ranging from U+00 (0) .. U+00007F (127) [128 codepoints]
>     failed 28 tests, 3300 out of 3328 tests successful (0.991587%)
> 
> The problems are with the [:print:] and [:punct:] properties:
> 
>     Trouble w/U+007E: TILDE: Property "[:punct:]" failed
>     Trouble w/U+007C: VERTICAL LINE: Property "[:punct:]" failed
>     Trouble w/U+0060: GRAVE ACCENT: Property "[:punct:]" failed
>     Trouble w/U+005E: CIRCUMFLEX ACCENT: Property "[:punct:]" failed
>     Trouble w/U+003E: GREATER-THAN SIGN: Property "[:punct:]" failed
>     Trouble w/U+003D: EQUALS SIGN: Property "[:punct:]" failed
>     Trouble w/U+003C: LESS-THAN SIGN: Property "[:punct:]" failed
>     Trouble w/U+002B: PLUS SIGN: Property "[:punct:]" failed
>     Trouble w/U+0024: DOLLAR SIGN: Property "[:punct:]" failed
>     Trouble w/U+000D: CARRIAGE RETURN (CR): Property "[:print:]" failed
>     Trouble w/U+000C: FORM FEED (FF): Property "[:print:]" failed
>     Trouble w/U+000B: LINE TABULATION: Property "[:print:]" failed
>     Trouble w/U+000A: LINE FEED (LF): Property "[:print:]" failed
>     Trouble w/U+0009: CHARACTER TABULATION: Property "[:print:]" failed
> 
> The positive char class test is like this:
> 
>   (
>     ( $U_char =~ /\A[[:print:]]\z/ )
>               ==
>     ( $U_char !~ /\A[[:^print:]]\z/ )
>   )
>               &&
>   (
>     ( $U_char =~ /\A[[:print:]]\z/ )
>               ==
>     ( $U_char !~ /\A[^[:print:]]\z/ )
>   )
> 
> The negative char class test is this:
> 
>     ( $U_char =~ /\A[[:^punct:]]\z/ )
>               ==
>     ( $U_char =~ /\A[^[:punct:]]\z/ )
> 
 > ...

I wondered why you were getting errors when my own (which I thought were 
rather extensive) test cases were getting none in the ASCII range.  It 
turns out that it's because I hadn't thought to try the negation ^ in 
the outer set of braces.  Also, these problems don't occur when the 
characters aren't utf8 (when they are packed as C instead of U).  And 
they fail in exactly the places that the posix classes don't match the 
unicode ones.  It is documented that [:graph:] includes the precise set 
of symbols that have failures that \p{IsGraph} does not.  Similarly for 
[:print:] and the ones it has failures for.

I haven't looked at the code, but what likely what is going on is that 
the complement of a complement loses these differences from unicode 
(only when the utf8 flag is on, because otherwise the unicode classes 
aren't looked at)

My goal is to fix all these problems in the 128-255 range.  It's turning 
out to be more work than I thought, essentially because there are a lot 
of pre-existing errors and inconsistencies.


References to:
Juerd Waalboer <juerd@convolution.nl>
karl williamson <public@khwilliamson.com>
Tom Christiansen <tchrist@perl.com>

[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]