[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]
Re: [perl #58430] perlbug AutoReply: Unicode::UCD::casefold() doesnot work as documented, nor prob as intended
I have looked into the problem some more, and now understand more how
the file is organized, but it isn't very well documented by the Unicode
foundation, nor by this module which essentially parrots the
documentation in the Unicode .txt file.
The file contains a mapping for code points into a "folded" codepoint,
so that supposedly code points that differ only in case will map to the
same folded one (maybe having to renormalize, but I'm not sure about
that). Most generally, this is the same as the lower case version of
the codepoint, but not always.
The authors of the .txt file were concerned that applications would not
be able to handle mappings that return a string of more than one code
point, so they marked all these as type F in the file. In some cases,
there is an alternative mapping that, though not as good, is better than
nothing. For these cases, there is a second entry in the file for the
code point, and it is marked as type S. Thus the F mappings are better
than the S mappings if the application is able to handle the length
change, but the S mappings are better than nothing. There is never an S
mapping without an F one as well. If a single codepoint translation is
fully acceptable, it has type C. (casefold() currently always returns
the S mapping, if it exists, simply because this version of the file
always has that mapping placed in the file after the F mapping, and the
function silently returns the last mapping in the file for a given code
point.)
if one is ignoring case, and one maps equivalent code points that are
The length-changing strings are denoted in the file by F, and if there
is a folding that is better than nothing for the same character, there
will be a second entry for it marked as S. There is never an S without
an F, but there can be F entries without S ones.
There are currently 2 mappings marked as T, for Turkish only.
If the application can handle length changes, I claim it will get the
best results by using the C and F mappings, and, if in a Turkish locale, T.
But this isn't what the function returns. It does not present an
adequate interface to the data.
One way to fix it would be to return everything, including a hash of
hashes for codepoints that have more than one possible folding, like the
casespec() function.
Another option would be to two extra boolean parameters, to denote
whether or not one wanted Turkish or not, and whether or not one was
willing to get a length-changing folded string returned.
This latter implementation would make it clearer to the application
writer what the implications of those choices are.
Karl Williamson
- Follow-Ups from:
-
David Landgren <david@landgren.net>
[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]