įor example, it won't allow surrogates in UTF8 accidentally: Python3 is a little more strict in a few places, e.g. That means that on both narrow and wide builds: > Īlso relevant is that UTF8 is aware of surrogates. ![]() ( verify)įor example, \U escapes for codepoints above U+FFFF will generate 2-codepoint strings (surrogate pairs) on narrow builds, 1-codepoint strings on wide builds. On windows, it looks like py2 builds were often narrow (probably relating to UTF16 windows interfaces), and p圓 builds are often wide. The narrow/wide distinction is still there in p圓 Still, when you write unicode manipulation functions you will will want to read up a little more. , particularly if you mostly just passing strings around, because encode and decode() are pretty clever about UTF. Note that that the standard library has good-enough handling of UTF16 surrogates that you might as well think of UCS2 as an UTF16 implementation. There are two flavours of unicode representation, chosen when building pythons since 2.2 (See also PEP261 )īuild options will call these UCS2 and UCS4, or narrow and wide. Trying to decode decode data that isn't UTF8 as UTF8 should be avoided. In particular in multiple-and-variable-byte encodings like UTF8 you may see many bytes being consumed even if they do not lead to a valid character. To avoid having such codec conversions throw exceptions, you can add 'ignore' as a second parameter - though you should know that this means you will garble the data, so you should not do this just 'to make errors go away.'
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |