[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]

Re: [perl #58430] perlbug AutoReply: Unicode::UCD::casefold() doesnot work as documented, nor prob as intended



I have looked into the problem some more, and now understand more how 
the file is organized, but it isn't very well documented by the Unicode 
foundation, nor by this module which essentially parrots the 
documentation in the Unicode .txt file.

The file contains a mapping for code points into a "folded" codepoint, 
so that supposedly code points that differ only in case will map to the 
same folded one (maybe having to renormalize, but I'm not sure about 
that).  Most generally, this is the same as the lower case version of 
the codepoint, but not always.

The authors of the .txt file were concerned that applications would not 
be able to handle mappings that return a string of more than one code 
point, so they marked all these as type F in the file.  In some cases, 
there is an alternative mapping that, though not as good, is better than 
nothing.  For these cases, there is a second entry in the file for the 
code point, and it is marked as type S.  Thus the F mappings are better 
than the S mappings if the application is able to handle the length 
change, but the S mappings are better than nothing.  There is never an S 
mapping without an F one as well.  If a single codepoint translation is 
fully acceptable, it has type C.  (casefold() currently always returns 
the S mapping, if it exists, simply because this version of the file 
always has that mapping placed in the file after the F mapping, and the 
function silently returns the last mapping in the file for a given code 
point.)

if one is ignoring case, and one maps equivalent code points that are
The length-changing strings are denoted in the file by F, and if there 
is a folding that is better than nothing for the same character, there 
will be a second entry for it marked as S.  There is never an S without 
an F, but there can be F entries without S ones.

There are currently 2 mappings marked as T, for Turkish only.

If the application can handle length changes, I claim it will get the 
best results by using the C and F mappings, and, if in a Turkish locale, T.

But this isn't what the function returns.   It does not present an 
adequate interface to the data.

One way to fix it would be to return everything, including a hash of 
hashes for codepoints that have more than one possible folding, like the 
casespec() function.

Another option would be to two extra boolean parameters, to denote 
whether or not one wanted Turkish or not, and whether or not one was 
willing to get a length-changing folded string returned.

This latter implementation would make it clearer to the application 
writer what the implications of those choices are.

Karl Williamson


Follow-Ups from:
David Landgren <david@landgren.net>

[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]