The omnipresence and mystery of Unicode
Let’s start with a trivia
Do you know what’s printed in the console after running the following code?
If you don’t know the answer or if you tried it on your end, you might be surprised by the output you get. But there is a fundamental reason behind it and if you’re thinking it’s a gotcha of Javascript …no, you can replicate it in other programming languages.
In this post, we are going to unveil why the output of the previous snippet is
History
We all know that computers only understand binary. So every character has assigned a number and before Unicode, there were many different systems called character encodings.
Have you struggled with typing the ‘@’?
At some point, we all used the shortcut ‘alt + 64’ where ‘64’ is the number to represent the ‘@’
There were characters, punctuation, and English/Spanish letters represented in the widely used character encoding called ASCII which used 7 bits to represent those characters, but later on, an Extended ASCII version that used 8 bits to represent 256 characters was created.
But ASCII lacked support for pictographic languages like Chinese/Japanese which introduced Unicode, the universal character encoding used nowadays to represent things like the emojis I presented at the beginning.
Unicode can represent around 150k different characters.
To recap, Unicode was created to represent and standardize every character and its corresponding number.
Now before we dive into answering our initial question, let’s review some terminology
Terminology
Code space: a set of numerical values ranging from 0 through 10FFFF called code points
Unicode plane: is a range of 65,536 Unicode code points. The entire character set is split into 17 planes.
Basic Multilingual Plane (BMP): plane 0 which contains the most commonly used characters like numbers & the alphabet.
U+0041 represents the letter ‘A’
Astral planes: planes 1-16 are named astral planes or supplementary planes. It contains historic scripts and symbols like Egyptian hieroglyphs and emojis.
Character: is a unit of information used to organize and represent textual data.
Every character has a name. For example Latin Capital Letter A it’s used to represent the letter ‘A’
Code point: a number to represent a Unicode character.
Their format is ‘u+<hex>’
U+0041 represents the letter ‘A’
A code point is encoded as code units.
Code units: a bit sequence used to encode each character within a given encoding form (to represent the abstract characters) in physical bits.
For instance, The letter A has a code point U+0041, and given the UTF-16 encoding, the code unit is 0x0041
Character Encoding: transforms abstract code points into physical bits: code units. In other words, the character encoding translates the Unicode code points to unique code unit sequences.
Popular encodings are UTF-8, UTF-16, and UTF-32.
UTF-8 uses 8-bit code units.
UTF-16 uses 16-bit code units.
UTF-32 uses 32-bit code units.
Code point, unit & character Encoding Recap
Grapheme or symbol: a distinctive unit of writing depending on the writing system. In other words how the character is drawn.
Combining mark: is a character that applies to the precedent base character to create a new grapheme.
For instance, in the next snippet, the first console.log represents the character ‘o’ and the ‘´’, but in the last console.log when combined they represent the letter ‘ó’
Encoding UTF-16
UTF-16 is commonly used in Javascript. And since now we know all terminology around Unicode. Let’s introduce surrogate pairs.
Since UTF-16 uses 16-bit code units and encodes the code points in this way:
The BMP is stored in one code unit because each plane can store 65,536 code points.
For the remaining planes (astral planes) we would need around 20 bits to store them.
Since we can’t store code points from astral planes in 16-bits, there is a surrogate mechanism to store each code point in two 16-bit code units as follows:
The first (high) surrogate is a 16-bit code value in the range U+D800 to U+DBFF.
The second (low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF.
For example
Code point 0x0061 ‘a’ is in the BMP and can be represented by a single UTF-16 code unit: 0x0061.
Code point 0x1F970 ‘🥰’ is in an astral plane and represented by two code units: 0xD83E and 0xDD70.
The solution to the trivia!
If you’re still here, know we have all information needed to explain the trivia!
Let’s start by reviewing the length of the emoji 👨👩👦
Then we get the hex code points for each character
But wait for a second, why are we getting 5 elements if the length of the emoji is 8?
Remember that Astral plane code points are encoded using two code units of 16-bit.
The elements at indexes 0, 2, and 4 are astral codes, so the sum of their length is 6 + the elements at indexes 1 & 3 gives us a total of 8!
Now let’s view what each code point represents
And because we know about combining marks we know it’s possible to build something like
But maybe you’re still wondering what that 200d is about. That character is called Zero Width Joiner used to combine characters. And this brings us to some curiosities and final thoughts about the majestic world of Unicode.
Curiosities and final thoughts
There are characters like Zero Width Joiner or Zero Width that can cause you a lot of trouble when copying & paste things from the internet.
Independently if you work with a high-level programming language or low-level one, knowing how Unicode works are fundamental since it’s used by everyone.
In Javascript, characters from the astral plane have a length of 2, for example:
In applications like Twitter when the 140-character limit existed, astral codes could cause unexpected problems in terms of space. Since the user doesn’t care about the size of a character.
In Javascript, you can use the surrogates or the code point to represent a given character, for example
Another example of combining marks is the use of skin tones
I hope you enjoyed the article and if the next time you are working on a code, where two texts have the EXACT characters but when checking they aren’t equal, remember that Unicode has some crazy things like Zero Width characters where a hidden character could be messing everything up.