UNICODE Basics: What's Character Encoding, UTF-8, and All That?

Perm url with updates: http://xahlee.org/emacs/unicode_basics.html

UNICODE Basics: What's Character Encoding, UTF-8, and All That?

Xah Lee, 2010-06-20, 2010-08-25

What's Character Encoding?

Any file has to go thru encoding/decoding in order to be properly displayed or written to. Suppose your language is Chinese (or Japanese, Russian, Arabic etc.). Your computer needs a way to translate the character set of your language's writing system into a sequence of 1s and 0s. This transformation is called Character encoding.

There are many encoding systems. Most popular are ASCII, UTF-8 (used in linux) and UTF-16 (used by Windows and OS X's file systems), GB 18030 (Used in China, contains all Unicode chars).

What's a Character Set?

A encoding system also defines a character set implicitly or explicitly. A character set is a fixed collection of symbols. A encoding system needs to define that because it needs to define what characters (symbols) it is designed to handle.

For example, ASCII is designed for Latin alphabets, and the set includes numbers and punctuation symbols. ASCII cannot be used for Arabian alphabet, Cryllic (Russian) alphabet, Chinese characters, etc. Nor can ASCII be used for some European languages that has characters such as è é å ø ü.

Unicode's Character Set, Code Point, and Encoding Systems

Unicode, defines a character set first, then it gives each character a unique ID. This id is just a integer, and is called the char's “coding point”.

Then, it defines several encoding systems. (a way to map a given code point into binary number) Most popular are UTF-8 and UTF-16. Each coding system is suitable for different purposes. UTF-8 is suitable for texts that are mostly Latin alphabet letters, numbers and punctuation symbols. (e.g. most linux use UTF-8)

UTF-8 is backwards compatible with ASCII. If your text is just ASCII, then encoding using UTF-8 results the same byte sequence as ASCII. UTF-16 is more modern, designed for unicode. With UTF-16, most commonly used characters in Unicode are 2 bytes. For Asian languages or texts that's mostly non-latin chars, UTF-16 is more efficient. Smaller file size and less complexity in processing.

There's also UTF-32, which always uses 4 bytes per character. It creates larger file size, but is simpler to parse. UTF-32 is currently not much used.

Decoding

When a editor opens a file, it needs to know the encoding system used, in order to decode the binary stream and map it to fonts to display the original characters properly. In general, the info about the encoding system used for a file is is not bundled with the file.

Before internet, there's little problem since most English speaking world uses ASCII, and non-English uses encoding schemes particular to their region.

With internet, files in different languages started to exchange a lot. When opening a file, Windows applications may try to guess the encoding system used, by some heuristics. On linux and emacs, this is usually not done. When opening a file in a app that assumed a wrong encoding, typically the result is gibberish. Usually, you can explicitly tell a app to use a particular encoding to open the file. (e.g. in web browsers, usually there's a menu. In Firefox, under View, Character Encoding.) Similarly, when writing to a file, there's usually a option for you to specify what encoding to use.

Fonts

For Asian languages, such as Chinese, Japanese, Korean, or langs using Arabic alphabet as its writing system (Arabic, Persian), you also need the proper font to display the file correctly.

See: Best Fonts for Unicode.

Input Method

For languages that are not based on alphabet, such as Chinese, Japanese, you need a input method to type it.

Popular posts from this blog

11 Years of Writing About Emacs

does md5 creates more randomness?

Google Code shutting down, future of ErgoEmacs