python - Is the Unicode code point value equal to the UTF-16BE representation for every character? -
i saved strings in microsoft agenda in unicode big endian format (utf-16be). when open shell command xxd
see binary value, write down, , value of unicode code point ord()
ordinal value character character (this python built-in function takes one-character unicode string , returns code point value), , compare them, find equal.
but think unicode code point value different utf-16be — 1 code point; other encoding format. of them equal, maybe different characters.
is unicode code point value equal utf-16be encoding representation every character?
no, codepoints outside of basic multilingual plane use two utf-16 words (so 4 bytes).
for codepoints in u+0000 u+d7ff , u+e000 u+ffff ranges, codepoint , utf-16 encoding map one-to-one.
for codepoints in range u+10000 u+10ffff, 2 words in range u+d800 u+dfff used; lead surrogate 0xd800 0xdbff , trail surrogate 0xdc00 0xdfff.
see utf-16 wikipedia article on nitty gritty details.
so, utf-16 big-endian bytes, when printed, can mapped directly unicode codepoints. utf-16 little-endian swap bytes around. utf-16 words in starting 0xd8 through 0xdf byte, you'll have map surrogates actual codepoint.
Comments
Post a Comment