python - Is the Unicode code point value equal to the UTF-16BE representation for every character? -


i saved strings in microsoft agenda in unicode big endian format (utf-16be). when open shell command xxd see binary value, write down, , value of unicode code point ord() ordinal value character character (this python built-in function takes one-character unicode string , returns code point value), , compare them, find equal.

but think unicode code point value different utf-16be — 1 code point; other encoding format. of them equal, maybe different characters.

is unicode code point value equal utf-16be encoding representation every character?

no, codepoints outside of basic multilingual plane use two utf-16 words (so 4 bytes).

for codepoints in u+0000 u+d7ff , u+e000 u+ffff ranges, codepoint , utf-16 encoding map one-to-one.

for codepoints in range u+10000 u+10ffff, 2 words in range u+d800 u+dfff used; lead surrogate 0xd800 0xdbff , trail surrogate 0xdc00 0xdfff.

see utf-16 wikipedia article on nitty gritty details.

so, utf-16 big-endian bytes, when printed, can mapped directly unicode codepoints. utf-16 little-endian swap bytes around. utf-16 words in starting 0xd8 through 0xdf byte, you'll have map surrogates actual codepoint.


Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

javascript - jQuery show full size image on click -