Some Unicode Notes

Update: I added another useful post about Unicode thanks to this tweet.

Every now and again I find myself reading up on character sets, usually when I'm doing some kind of heavy text processing.

When I first started out programming I used to tell myself, "Just use ASCII, it's not like I write programs for people who speak other languages right?". Well, as the Web became more multi-lingual, it's increasingly harder to justify that thinking.

Even if you are not writing non-English programs there may be other reasons you want to pay attention to character encoding. UTF-8 is apparently mandatory on the Web so having a basic understanding could be seen as foundational.

UTF-8 has some terms and notes I need to keep in mind:

  • UTF-8 can use 1,2,3, or 4 bytes to encode a character, that is 8 bits, 16 bits, 24bits or 32bits,
  • because of this it's called a variable width  character encoding.
  • A UTF-8 file that contains only characters in the ASCII range is identical to an ASCII file.
  • The term Basic Multilingual Plane(BMP) or Plane refers to commonly used characters for all language scripts in the world.
  • BMP consists of the code points  0000- FFFF (hexadecimal).
  • A breakdown of the groups can be found here.

Aside from UTF-8 there is also UTF-16 and UTF-32.

Both are not backwards compatible with ASCII. That is, if you convert an ASCII file to UTF-16 or UTF-32, it's contents at the byte level are going to change (but not the actual data).

UTF-16 is variable width like UTF-8 but uses either one or two 16 bit values for each code point. ASCII of course only uses one byte (8bits) for each of its characters, hence the backwards incompatibility.

I should mention that a code point is the numeric value assigned to a character in a character set. Think of a character set as a huge array where the code point is an index.

UTF-32 is fixed width and each character is encoded using 32bits (a whole 4 bytes!). Converting an ASCII file into UTF-16 or UTF-32 encoding is going to result in a larger size. That goes for databases as well.

I think what trips me up are statements like "JavaScript/ECMAScript supports UTF-8" that I saw online when I was first learning the language.  It had me assuming that the escape sequences \u0000 - \uFFFF are how you use UTF-8 code points in strings.

I even wrote that here in an earlier draft.

That's not really accurate, all strings in JavaScript are UTF-16 encoded, or at least, each code point is represented by a 16 bit value as the spec does not get into the details of implementation. When a code point in the range 0xD800 - 0xDBFF occurs next to a code point in the range 0xDC00 - 0xDFFF it is called a surrogate pair and the spec describes an algorithim for calculating the resulting value.

The `\uXXXX` syntax is actually for representing characters of the BMP that may not be on your keyboard. At runtime, it's still a 16 bit value in memory.

Where does UTF-8 come in? It's your actual source code! The JS files that you serve or pass to node are expected to be UTF-8 encoded! This is consistent with the requirement that everything be UTF-8 encoded.

I tried passing a UTF-32 encoded file to node.

There may be opportunities here for subtle bugs and even security issues but I have not taken the time to properly understand them yet. I do know that mixing up the encoding of your data in a MariaDB/MySQL server can cause your queries to give the wrong results.

I have also heard of XSS and SQL injection bypasses by manipulating misconfigured character encoding. I'm starting to get a better understanding as to how, but again I have not dug into it yet.

