[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]
Re: [perl #58430] Unicode::UCD::casefold() does not work as documented,nor prob as intended
I guess I should submit a patch for this, as I've ended up fixing it for
my own use. But I need to have a consensus about what the interface
should be, so I am requesting people to comment.
This function is an interface to a source file from the Unicode
Consortium. The programmer who wrote this function was apparently
unaware that some code points can have multiple foldings defined. In
such cases, the "full" folding is superior to the "simple" one, but
harder to implement, which is why the source file gives an inferior
alternative, when available. (The Perl documentation says that it uses
the full folding when ignoring case, but I no longer trust the
documentation, and haven't looked at the code to verify this.)
The foldings in the file are supposed to be locale independent, but
there are two code points where they were unable to live up to that
goal, both in Turkic languages, and so there are two special entries in
the file for these. These are marked in the file with a 'T', not the
'I' that the function documentation says, and internally expects. (This
means that the existing function implementation will never return these
two entries.)
What the function currently actually returns is a hash for the simplest
defined encoding for all code points (except for one in a Turkic
language). What this means is for code points with only a full folding
defined (except for that one exception in Turkic), it returns the full
folding, but for code points which also have a simple folding defined,
it returns the simple one.
My initial straw proposal is to retain the current interface so that
existing programs that use it won't have to change, even though it isn't
what they might be expecting, as the documentation is wrong. I would
then add a second interface which would return a hash, but with entries
for all the possible foldings for the code point parameter to the function.
Another possibility that I like more and more is to merely add all the
possible foldings to the current returned hash. Then, no new interface
would have to be defined, current code using the function could be
unchanged. The current key is called 'mapping'. That would remain, but
the documentation would be updated to define it properly as the simplest
folding available for the parameter code point (and it would remain
mis-named). Additional keys would be the ones in the file: 'C', 'S',
'F', and 'T', and their values would be their corresponding foldings.
('C' is used to mean complete, and is the most common. It means the
'full' and 'simple' foldings are the same.) The other existing keys,
would remain. 'status' would retain its current meaning of saying which
of [CSFT] 'mapping' is from, but the documentation would change to
emphasize that there may be other possible mappings for this code point.
(But also note that the current documentation says 'status' can be 'I'
(which can never be returned) instead of 'T'.)
There are lots of other possibilities as well. Please respond with your
comments.
Karl Williamson
- Follow-Ups from:
-
"Rafael Garcia-Suarez" <rgarciasuarez@gmail.com>
[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]