
Added a bit on variation selector codepoints and fixed an incorrect codepoint in the airplane example Change-Id: I849ad13b4408c8e3f2a0ff60aebe697ae8e30d39 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4220143 Reviewed-by: Elaine Chien <elainechien@chromium.org> Commit-Queue: David Yeung <dayeung@chromium.org> Cr-Commit-Position: refs/heads/main@{#1101802}
218 lines
7.7 KiB
Markdown
218 lines
7.7 KiB
Markdown
# Unicode Overview
|
||
|
||
This document goes over general concepts of Unicode and text rendering.
|
||
|
||
## **Breakdown**
|
||
This is a general overview of how text gets transformed from raw bytes to a
|
||
glyph on the screen. The document is catered towards explaining niche concepts
|
||
while also providing some pitfalls to avoid.
|
||
|
||
Chrome deals with Unicode hence this assumes text will be rendered as unicode
|
||
characters. Chrome uses the ICU library for Unicode:
|
||
[ICU](https://icu.unicode.org/) which is an open source set of libraries that
|
||
provide Unicode support for many different applications.
|
||
|
||
This doc will be going over 4 stages.
|
||
1. Unicode codepoint encodings
|
||
2. Codepoints
|
||
3. Graphemes
|
||
4. Glyphs
|
||
|
||
## **Unicode codepoint encodings**
|
||
|
||
Binary and Codepoint encoding are CS concepts with a lot of online resources.
|
||
You can search for any online resource to get more familiar with this concept.
|
||
- [Unicode Technical Reports](https://www.unicode.org/reports/)
|
||
|
||
Codepoint encoding is a way for software to convert codepoints to assigned bytes
|
||
while codepoint decoding will transform bytes back to characters. The most
|
||
commonly used encodings are UTF-8 and UTF-16.
|
||
|
||
Imagine this as a mapping between binary and codepoints.
|
||
- User inputs a string while specifying the codepoint encoding. (Note: Generally
|
||
this fallsback to the default encoding scheme)
|
||
- Program will transform the string into binary and store it.
|
||
- When the program needs to use the string again, it fetches the binary and
|
||
decodes it back to the string with the encoding schema.
|
||
|
||
The main difference between encoding schemas is how much memory is reserved for
|
||
each codepoint.
|
||
- `UTF-8` - This is a variable length 8 bit encoding scheme that saves the most
|
||
memory for the first 127 values (in a single byte) but can also grow in size
|
||
all the way up to 6 bytes depending on the codepoint.
|
||
- `UTF-16` - This is a variable length (1 or 2 bytes) 16 bit encoding scheme
|
||
that can support most codepoints while using less memory than UTF-32. Generally
|
||
used if you only need to support most languages/symbols.
|
||
- `UTF-32` - This is a fixed width for 32 bits that can support all codepoints.
|
||
The tradeoff however is that it will use the most memory per codepoint.
|
||
|
||
For example:
|
||
In UTF-8 the letter `A` is represented as the following:
|
||
```
|
||
Binary 01000001 => Codepoint (U+0041).
|
||
UTF-8:
|
||
41
|
||
std::string str = "\x41";
|
||
UTF-16:
|
||
0041
|
||
std::u16string str = u"\x41"
|
||
UTF-32:
|
||
000000041
|
||
std::u32string str = U"\U00000041"
|
||
```
|
||
For characters that might require multiple 4 bytes like the treble clef `𝄞`, the
|
||
encoding will change depending on the schema.
|
||
```
|
||
Binary
|
||
11110000 10011101 10000100 10011110
|
||
UTF-8:
|
||
F0 9D 84 9E
|
||
std::string str = "\xF0\x9D\x84\x9E"
|
||
UTF-16:
|
||
D834 DD1E
|
||
std::u16string str = u"\xD834\xDD1E"
|
||
UTF-32:
|
||
0001D11E
|
||
std::u32string str = U"\U0001D11E"
|
||
```
|
||
|
||
## **Codepoints**
|
||
|
||
A codepoint is a unique number that represents some type of character +
|
||
information. Codepoints also have
|
||
[properties/attributes](https://unicode-org.github.io/icu/userguide/strings/properties.html)
|
||
that describe how to perform rendering on them. (i.e. BiDi, Block, Script, etc)
|
||
|
||
For example, `U+0041` is a codepoint that represents the letter `A`. Unicode has
|
||
a large library of codepoints that handles characters from different languages,
|
||
symbols used in pronunciation, and even emojis.
|
||
|
||
Some codepoints can affect their surronding characters.
|
||
For example, diacritical codepoints such as the "combining acute accent"
|
||
`(U+0301)` is used to append an accent on a character.
|
||
```
|
||
◌́
|
||
```
|
||
Some codepoints do not map to a displayed character.
|
||
ZWJ (`U+200D`) is a zero width joiner and is a codepoint that joins two
|
||
codepoints together. Left-To-Right Embedding (`U+202A`) is a codepoint that
|
||
forces text to be interpreted as left-to-right.
|
||
|
||
Variation selectors are another set of codepoints that only affects their
|
||
surrounding character. These codepoints will affect the presentation of the
|
||
preceding character. For example an emoji + U+FE0E will set the emoji to a text
|
||
display while emoji + U+FE0F will set the emoji to the colored display. If you
|
||
do not specify a variation, the shaping engine will just pick the default glyph
|
||
in the font.
|
||
|
||
```
|
||
U+2708 maps to an airplane: ✈️
|
||
|
||
Adding a variation selector (U+FE0E or U+FE0F) will affect the way the emoji is
|
||
displayed
|
||
|
||
U+2708 U+FE0E = ✈︎
|
||
U+2708 U+FE0F = ✈️
|
||
|
||
```
|
||
|
||
## **Graphemes**
|
||
|
||
A grapheme is a sequence of one or multiple codepoints. For example, “e” and “é”
|
||
are both graphemes. “e” is a single codepoint while “é” can be either a single
|
||
codepoint or multiple codepoints depending on how it’s encoded.
|
||
|
||
```
|
||
É (U-00E9)
|
||
```
|
||
or
|
||
```
|
||
“e” + “´”
|
||
(U-0065) + (U-00B4)
|
||
```
|
||
Emojis are another example of grapheme clusters that can be combinations of
|
||
multiple codepoints.
|
||
|
||
```
|
||
👨✈️ is actually a combination of
|
||
👨 Man (U+1F468) +
|
||
Zero Width Joiner (U+200D) +
|
||
✈️ Airplane (U+2708)
|
||
```
|
||
**Note: Graphemes are not breakable!**
|
||
Because of codepoints such as diatric or joiners that can append multiple
|
||
codepoints together to a grapheme, many codepoints can make up a single
|
||
grapheme.
|
||
|
||
For example a grapheme can consist of :
|
||
- 1x codepoint: codepoint
|
||
- 2x codepoint: base + diatric
|
||
- 3x codepoint: joiner codepoint [presentation]
|
||
- Nx codepoint: codepoint + joiner + codepoint + joiner + codepoint + etc
|
||
|
||
Luckily we do not need to implement all of the various combinations or
|
||
understand what makes a valid grapheme. Chrome relies on the ICU library to
|
||
iterate through graphemes.
|
||
|
||
## **Glyphs**
|
||
|
||
A glyph is set of graphic primitives that are painted to the screen that
|
||
represents the grapheme. Fonts are pre-made mappings between characters and
|
||
glyphs. The pixels printed on the screen will be based on the Font loaded that
|
||
maps Graphemes to Glyphs. Fonts can be user controlled so any type of image can
|
||
be displayed depending on what is mapped to the grapheme.
|
||
|
||
Note that there isn't a guarantee mapping of 1:1 between grapheme and glyph, it
|
||
is more of an N:M.
|
||
|
||
Translating Grapheme to Glyphs is a multi-step process that will be covered
|
||
here: (TODO - add link to RenderText doc).
|
||
|
||
## **Pitfalls:**
|
||
### ***Rule: Do not break up graphemes!***
|
||
Modifying an encoded string is hard and has more risk then you think.
|
||
|
||
If you are trying to truncate a string to a length of 3, you might try something
|
||
like:
|
||
```
|
||
string new_string = original_string.substr(0,3);
|
||
```
|
||
But that is incorrect!
|
||
|
||
Lets imagine `original_string` is actually an emoji and you attempt to truncate
|
||
the string to a width of 3.
|
||
```
|
||
std::string original_string = "😔"; // Encoded as "\xF0\x9F\x98\x94" length = 4
|
||
string new_string = original_string.substr(0,3);
|
||
```
|
||
Well what's wrong here? You might think this is fine since there is only 1
|
||
displayed character in the emoji string, but that is incorrect!
|
||
`"😔" maps to "F0 9F 98 94"`
|
||
|
||
The length of the string is actually 4. By taking a substring of emoji with
|
||
width of 3, it will corrupt the data by removing the last codepoint of the
|
||
string.
|
||
`F0 9F 98` ~~`94`~~
|
||
|
||
This becomes a corrupted string which does not map to a valid unicode character.
|
||
|
||
Developers might also attempt to highlight only a section of codepoints of the
|
||
string.
|
||
|
||
`SetColor(Yellow, Range(1...3));`
|
||
|
||
But this is also incorrect depending on which codepoints are highlighted.
|
||
Similar to the previous example, if the range is only part of one codepoint
|
||
within the grapheme, this will break!
|
||
|
||
It is impossible to know what color to use for that glyph because it's range
|
||
does not encompass the entire grapheme.
|
||
|
||
## **Recommendations:**
|
||
|
||
### Avoid using custom text modifications
|
||
|
||
The crux of this document is to highlight the difficulties of Unicode. Before
|
||
trying to add custom string modifiers, look at the `gfx::` namespace to see if
|
||
the functionality already exists. If it is not covered, please reach out to the
|
||
owners and consider adding the new functionality to `gfx::`. |