[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]

Re: [perl #58430] Unicode::UCD::casefold() does not work as documented,nor prob as intended



I guess I should submit a patch for this, as I've ended up fixing it for 
my own use.  But I need to have a consensus about what the interface 
should be, so I am requesting people to comment.

This function is an interface to a source file from the Unicode 
Consortium.  The programmer who wrote this function was apparently 
unaware that some code points can have multiple foldings defined.  In 
such cases, the "full" folding is superior to the "simple" one, but 
harder to implement, which is why the source file gives an inferior 
alternative, when available.  (The Perl documentation says that it uses 
the full folding when ignoring case, but I no longer trust the 
documentation, and haven't looked at the code to verify this.)
The foldings in the file are supposed to be locale independent, but 
there are two code points where they were unable to live up to that 
goal, both in Turkic languages, and so there are two special entries in 
the file for these.  These are marked in the file with a 'T', not the 
'I' that the function documentation says, and internally expects.  (This 
means that the existing function implementation will never return these 
two entries.)

What the function currently actually returns is a hash for the simplest 
defined encoding for all code points (except for one in a Turkic 
language).  What this means is for code points with only a full folding 
defined (except for that one exception in Turkic), it returns the full 
folding, but for code points which also have a simple folding defined, 
it returns the simple one.

My initial straw proposal is to retain the current interface so that 
existing programs that use it won't have to change, even though it isn't 
what they might be expecting, as the documentation is wrong.   I would 
then add a second interface which would return a hash, but with entries 
for all the possible foldings for the code point parameter to the function.

Another possibility that I like more and more is to merely add all the 
possible foldings to the current returned hash.  Then, no new interface 
would have to be defined, current code using the function could be 
unchanged. The current key is called 'mapping'.  That would remain, but 
the documentation would be updated to define it properly as the simplest 
folding available for the parameter code point (and it would remain 
mis-named).  Additional keys would be the ones in the file: 'C', 'S', 
'F', and 'T', and their values would be their corresponding foldings. 
('C' is used to mean complete, and is the most common.  It means the 
'full' and 'simple' foldings are the same.)  The other existing keys, 
would remain.  'status' would retain its current meaning of saying which 
of [CSFT] 'mapping' is from, but the documentation would change to 
emphasize that there may be other possible mappings for this code point. 
  (But also note that the current documentation says 'status' can be 'I' 
(which can never be returned) instead of 'T'.)

There are lots of other possibilities as well.  Please respond with your 
comments.

Karl Williamson


Follow-Ups from:
"Rafael Garcia-Suarez" <rgarciasuarez@gmail.com>

[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]