[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]
capture the wtf8 flag
For the nth time this year, I am banging my head against email, 7-bit
transport, 8bit content-transfer-encoding, and our hateful savior Unicode.
I asked something like this on #p5p:
"If the utf8 flag is set, it's known to be character data, right?"
I got a lot of "OH GOD NOT TRUE!" responses, but now I can't find much evidence
that this isn't true. I was told "You just need to keep track!" Ugh. I
believe I found a suggestion to use Hungarian notation for this somewhere in
the core docs.
Anyway, being able to mostly trust this would be great. Maybe some of the more
funny-character-friendly members of the audience can lend some wisdom. My
understanding is something like this:
1. If the utf8 flag is on, it's character data and you can safely use
Encode::encode to produce UTF-8 encoded octets.
2. If the utf8 flag is off, all bets are off. Good luck, sucker.
3. Oops: Some moron might set the utf8 bit on incorrectly, but that's a bug
and not a magic random property of the flag itself.
Then:
1. The utf8 flag will get turned on when decoding byte strings.
2. The utf8 flag should get set correctly by XS code that produces
UTF8 strings.
3. Oops: if your decoded string was all valid ASCII, you won't get utf8
turned on. If you want to be able to rely on the utf8 flag, set it
yourself, immediately after successfully decoding.
Implications:
1. If you get handed a string with the utf8 flag on, you know that either (a)
it is safe to Encode::encode into utf-8 or (b) some moron set the flag
by mistake. It is never valid for utf8 to be set if (a) is false.
2. If you get handed a string with the utf8 flag off, you're hosed. You
should've set that as soon as possible upon reading the string into
memory, as by now you've probably forgotten what the source of the data
was, let alone its encoding. (No, 7-bit isn't safe either, because maybe
it was unicode-1-1-utf-7.)
Is this true? If so, I can probably adopt the following behavior:
1. Given a string with utf8 on, assume charset of utf-8 and encode to that.
2. Assume it's a byte string and hope for the best.
That's an improvement since right now everything basically does (2).
Thanks. I look forward to gaining a better understanding of the wtf flag.
--
rjbs
- Follow-Ups from:
-
Aristotle Pagaltzis <pagaltzis@gmx.de>
Glenn Linderman <perl@NevCal.com>
[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]